选择Pandas DataFrame的子集

时间:2014-09-27 13:05:26

标签: python pandas

我有两个不同的pandas DataFrames,我想在其他DataFrame同时具有特定值时从一个DataFrame中提取数据。具体来说,我有一个名为" GDP"看起来如下:

               GDP
DATE               
1947-01-01    243.1
1947-04-01    246.3
1947-07-01    250.1

我还有一个名为"经济衰退"的数据框架。其中包含以下数据:

            USRECQ
DATE         
1949-07-01       1
1949-10-01       1
1950-01-01       0

我想创建两个新的时间序列。每当USRECQ在同一日期的值为0时,应包含GDP数据。每当USRECQ在同一日期的值为1时,另一个应该包含GDP数据。我怎样才能做到这一点?

1 个答案:

答案 0 :(得分:4)

让我们修改您发布的示例,以便日期重叠:

import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
                   index=pd.date_range('2000-1-1', periods=10, freq='D'))

#             GDP
# 2000-01-01    0
# 2000-01-02   10
# 2000-01-03   20
# 2000-01-04   30
# 2000-01-05   40
# 2000-01-06   50
# 2000-01-07   60
# 2000-01-08   70
# 2000-01-09   80
# 2000-01-10   90

recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
                         index=pd.date_range('2000-1-2', periods=10, freq='D'))
#             USRECQ
# 2000-01-02       0
# 2000-01-03       0
# 2000-01-04       0
# 2000-01-05       0
# 2000-01-06       0
# 2000-01-07       1
# 2000-01-08       1
# 2000-01-09       1
# 2000-01-10       1
# 2000-01-11       1

然后你可以加入两个数据帧:

combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
#             GDP  USRECQ
# 2000-01-01    0     NaN
# 2000-01-02   10       0
# 2000-01-03   20       0
# 2000-01-04   30       0
# 2000-01-05   40       0
# 2000-01-06   50       0
# 2000-01-07   60       1
# 2000-01-08   70       1
# 2000-01-09   80       1
# 2000-01-10   90       1
# 2000-01-11  NaN       1

并根据以下条件选择行:

In [112]: combined.loc[combined['USRECQ']==0]
Out[112]: 
            GDP  USRECQ
2000-01-02   10       0
2000-01-03   20       0
2000-01-04   30       0
2000-01-05   40       0
2000-01-06   50       0

In [113]: combined.loc[combined['USRECQ']==1]
Out[113]: 
            GDP  USRECQ
2000-01-07   60       1
2000-01-08   70       1
2000-01-09   80       1
2000-01-10   90       1
2000-01-11  NaN       1

要获得GDP列,请将列名称作为combined.loc的第二项:

In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]: 
2000-01-07    60
2000-01-08    70
2000-01-09    80
2000-01-10    90
2000-01-11   NaN
Freq: D, Name: GDP, dtype: float64

正如PaulH所指出的那样,你也可以使用query,它有更好的语法:

In [118]: combined.query('USRECQ==1')
Out[118]: 
            GDP  USRECQ
2000-01-07   60       1
2000-01-08   70       1
2000-01-09   80       1
2000-01-10   90       1
2000-01-11  NaN       1