Question

我有一个pandas timeseries数据框，其日期设置为索引和多个列（一个是cusip）。

我希望遍历数据框并创建一个新的数据框，其中，对于每个cusip，我会获取最新的数据。

我尝试使用groupby：

newData = []
for group in df.groupby(df['CUSIP']):
    newData.append(group[group.index == max(group.index)])

'builtin_function_or_method' object is not iterable


In [374]: df.head()
Out[374]: 
              CUSIP        COLA         COLB       COLC  
date                                                          
1992-05-08    AAA          238         4256      3.523346   
1992-07-13    AAA          234         4677      3.485577   
1992-12-12    BBB          221         5150      3.24
1995-12-12    BBB          254         5150      3.25
1997-12-12    BBB          245         6150      3.25
1998-12-12    CCC          234         5140      3.24145
1999-12-12    CCC          223         5120      3.65145

我想：

              CUSIP        COLA         COLB       COLC  
date           
1992-07-13    AAA          234         4677      3.485577      
1997-12-12    BBB          245         6150      3.25
1999-12-12    CCC          223         5120      3.65145

我应该采用另一种方法吗？谢谢。

Answer 1

In [17]: df
Out[17]: 
           cusip    a     b         c
date                                 
1992-05-08   AAA  238  4256  3.523346
1992-07-13   AAA  234  4677  3.485577
1992-12-12   BBB  221  5150  3.240000
1995-12-12   BBB  254  5150  3.250000
1997-12-12   BBB  245  6150  3.250000
1998-12-12   CCC  234  5140  3.241450
1999-12-12   CCC  223  5120  3.651450

[7 rows x 4 columns]

排序

In [18]: df = df.sort_index()

In [19]: df
Out[19]: 
           cusip    a     b         c
date                                 
1992-05-08   AAA  238  4256  3.523346
1992-07-13   AAA  234  4677  3.485577
1992-12-12   BBB  221  5150  3.240000
1995-12-12   BBB  254  5150  3.250000
1997-12-12   BBB  245  6150  3.250000
1998-12-12   CCC  234  5140  3.241450
1999-12-12   CCC  223  5120  3.651450

[7 rows x 4 columns]

从每组中取出最后一个元素

In [20]: df.groupby('cusip').last()
Out[20]: 
         a     b         c
cusip                     
AAA    234  4677  3.485577
BBB    245  6150  3.250000
CCC    223  5120  3.651450

[3 rows x 3 columns]

如果要保留日期索引，请先重置，分组，然后重新设置索引

In [9]: df.reset_index().groupby('cusip').last().reset_index().set_index('date')
Out[9]: 
           cusip    a     b         c
date                                 
1992-07-13   AAA  234  4677  3.485577
1997-12-12   BBB  245  6150  3.250000
1999-12-12   CCC  223  5120  3.651450

[3 rows x 4 columns]

Answer 2

我是这样做的

df = pd.read_csv('/home/desktop/test.csv' )

将日期转换为日期时间

df = df.reset_index()
df['date'] = pd.to_datetime(df['date'])

按照您希望的方式对数据框进行排序

df = df.sort(['CUSIP','date'], ascending=[True,False]).groupby('CUSIP')

定义聚合时发生的事情（根据排序方式）

def return_first(pd_series):
    return pd_series.values[0]

使dict将相同的函数应用于所有列

agg_dict = {c: return_first for c in df.columns}

最终聚合

df = df.agg(agg_dict)

编辑：将日期转换为日期时间可避免此类错误：

In [12]: df.sort(['CUSIP','date'],ascending=[True,False])
Out[12]: 
         date CUSIP  COLA  COLB      COLC           date_time

6  1999-12-12   CCC   223  5120  3.651450 1999-12-12 00:00:00
5  1998-12-12   CCC   234  5140  3.241450 1998-12-12 00:00:00
8   1997-12-4   DDD   999  9999  9.999999 1997-12-04 00:00:00
9  1997-12-05   DDD   245  6150  3.250000 1997-12-05 00:00:00
7   1992-07-6   DDD   234  4677  3.485577 1992-07-06 00:00:00

Groupby在pandas timeseries数据框中选择最近的事件

2 个答案: