根据预先存在的行

时间:2018-03-27 20:44:58

标签: python pandas dataframe merge concat

我有以下函数创建一个初始DataFrame,迭代函数的dict并在每次迭代时将数据帧连接到初始数据框:

def get_variables_daily(start_date='1990-06-08', end_date='2015-05-04', describe=False):

variables_daily = {"DP": [1, get_DP_daily], "PE": [1, get_PE_daily], "BM": [1, get_BM_daily], "CAPE": [1, get_CAPE_daily], 
         "PCAprice": [1, get_PCAprice_daily], "BY": [1, get_BY_daily], "DEF": [1, get_DEF_daily],
         "TERM": [1, get_TERM_daily], "CAY": [1, get_CAY_daily], "SIM": [1, get_SIM_daily], "VRP": [1, get_VRP_daily], 
         "IC": [0, get_IC_daily], "BDI": [1, get_BDI_daily], "NOS": [1, get_NOS_daily], "CPI": [1, get_CPI_daily],
         "PCR": [1, get_PCR_daily], "MA": [1, get_MA_daily],  "PCAtech": [0, get_PCAtech_daily], 
         "OIL": [1, get_OIL_daily], "SI": [1, get_SI_daily]}

    start_date = pd.to_datetime(start_date, yearfirst=True)
    end_date = pd.to_datetime(end_date, yearfirst=True)
    #create initial timeseries
    SPXR_1M = get_SPXR_daily(22, '1990-06-08', '2015-05-04')
    SPXR_3M = get_SPXR_daily(65, '1990-06-08', '2015-05-04')
    SPXR_6M = get_SPXR_daily(130, '1990-06-08', '2015-05-04')
    SPXR_12M = get_SPXR_daily(252, '1990-06-08', '2015-05-04')

    df1 = pd.concat([SPXR_1M, SPXR_3M, SPXR_6M, SPXR_12M], axis=1)
    #iterate over variables
    for key in variables_daily.keys():
        #check if variable should be used
        check = variables_daily[key][0]
        if check == 1:
            df2 = variables_daily[key][1](start_date, end_date).convert_objects(convert_numeric=True) 
            df1 = pd.concat([df1, df2], axis=1)                                   
    return df

如您所见,SPXR_1M,SPXR_3M,SPXR_6M和SPXR_12M是我的DataFrame的基础,这意味着我应该没有比SPXR_1M更多的行。但是,如果你看一下最终DF的摘要:

            count       mean       std        min        25%        50%  \
DP         5706.0   0.018063  0.004894   0.008400   0.014900   0.017900   
PE         6497.0  19.750139  4.267477  10.949800  16.581000  18.395300   
BM         6497.0   0.371955  0.088411   0.192378   0.323687   0.369440   
CAPE       6275.0  25.824579  6.981803  11.849780  20.973447  24.878816   
PCAprice   5706.0  -3.125544  3.082865 -17.258065  -4.958795  -2.354091   
BY         6249.0   0.977558  0.105177   0.566707   0.915942   0.972425   
DEF        6231.0   0.954645  0.413315   0.430000   0.700000   0.870000   
TERM       6485.0   1.865422  1.158198  -0.989000   0.916300   1.994900   
CAY        6275.0   0.000324  0.016228  -0.031944  -0.012895  -0.002148   
SIM        6252.0   0.742821  0.324054   0.007692   0.484615   0.976923   
VRP        6272.0   0.066305  0.038604  -0.141507   0.042648   0.059513   
BDI        6246.0   0.044917  0.324404  -0.900719  -0.133461   0.012865   
NOS        6191.0   0.010533  0.043129  -0.193359  -0.011152   0.010640   
CPI        6275.0   0.023318  0.011918  -0.020422   0.016667   0.024161   
PCR        6275.0  -1.361110  0.363751  -2.260664  -1.609558  -1.412000   
MA         6497.0   0.769432  0.421229   0.000000   1.000000   1.000000   
OIL        6252.0   0.012821  0.179430  -1.132002  -0.079546   0.029171   
SI         2226.0   3.689411  0.744130   1.952062   3.182868   3.719568   
SPXR_22D   6253.0   0.007297  0.045672  -0.297937  -0.016283   0.010915   
SPXR_65D   6210.0   0.022014  0.076810  -0.409638  -0.013432   0.028260   
SPXR_130D  6145.0   0.046397  0.113950  -0.474598  -0.003915   0.055771   
SPXR_252D  6023.0   0.091534  0.169579  -0.488228   0.028048   0.110181   

                 75%        max  
DP          0.020300   0.040000  
PE         23.155400  30.720600  
BM          0.437982   0.688610  
CAPE       27.720910  47.255292  
PCAprice   -0.651261   0.001848  
BY          1.042780   1.457659  
DEF         1.050000   3.500000  
TERM        2.793000   3.863000  
CAY         0.012844   0.031044  
SIM         1.000000   1.000000  
VRP         0.082342   0.372942  
BDI         0.196221   2.320175  
NOS         0.033015   0.253546  
CPI         0.029222   0.059571  
PCR        -1.140890  -0.333101  
MA          1.000000   1.000000  
OIL         0.119024   0.771293  
SI          4.202039   5.760021  
SPXR_22D    0.033889   0.224057  
SPXR_65D    0.067303   0.388187  
SPXR_130D   0.112865   0.541292  
SPXR_252D   0.200745   0.685735  

你可以看到观察结果并不一致,如果它们的行数较少,它们基本上都应该是6253或更少。我的串联是否正确处理附加数据帧中的额外行? 编辑:在所有连接之后,我的初始列中似乎存在许多空白。有没有办法让panda只添加数据帧A已经拥有的数据帧B中的行?

1 个答案:

答案 0 :(得分:1)

我得到了一些想法:

  • pd.concat将混合数据框指示并为缺失值生成NA,尝试设置join=inner参数

  • 如果您可以按列重新组织数据创建 - 也可以尝试merge代替https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html,可能需要左合并

  • 不要用.dropna()关闭NA值 - 它们可能会显示插入了哪些额外的行

  • 尝试关闭第二次连续呼叫 - 以便您可以决定两者中的哪一个引起麻烦