我有以下函数创建一个初始DataFrame,迭代函数的dict并在每次迭代时将数据帧连接到初始数据框:
def get_variables_daily(start_date='1990-06-08', end_date='2015-05-04', describe=False):
variables_daily = {"DP": [1, get_DP_daily], "PE": [1, get_PE_daily], "BM": [1, get_BM_daily], "CAPE": [1, get_CAPE_daily],
"PCAprice": [1, get_PCAprice_daily], "BY": [1, get_BY_daily], "DEF": [1, get_DEF_daily],
"TERM": [1, get_TERM_daily], "CAY": [1, get_CAY_daily], "SIM": [1, get_SIM_daily], "VRP": [1, get_VRP_daily],
"IC": [0, get_IC_daily], "BDI": [1, get_BDI_daily], "NOS": [1, get_NOS_daily], "CPI": [1, get_CPI_daily],
"PCR": [1, get_PCR_daily], "MA": [1, get_MA_daily], "PCAtech": [0, get_PCAtech_daily],
"OIL": [1, get_OIL_daily], "SI": [1, get_SI_daily]}
start_date = pd.to_datetime(start_date, yearfirst=True)
end_date = pd.to_datetime(end_date, yearfirst=True)
#create initial timeseries
SPXR_1M = get_SPXR_daily(22, '1990-06-08', '2015-05-04')
SPXR_3M = get_SPXR_daily(65, '1990-06-08', '2015-05-04')
SPXR_6M = get_SPXR_daily(130, '1990-06-08', '2015-05-04')
SPXR_12M = get_SPXR_daily(252, '1990-06-08', '2015-05-04')
df1 = pd.concat([SPXR_1M, SPXR_3M, SPXR_6M, SPXR_12M], axis=1)
#iterate over variables
for key in variables_daily.keys():
#check if variable should be used
check = variables_daily[key][0]
if check == 1:
df2 = variables_daily[key][1](start_date, end_date).convert_objects(convert_numeric=True)
df1 = pd.concat([df1, df2], axis=1)
return df
如您所见,SPXR_1M,SPXR_3M,SPXR_6M和SPXR_12M是我的DataFrame的基础,这意味着我应该没有比SPXR_1M更多的行。但是,如果你看一下最终DF的摘要:
count mean std min 25% 50% \
DP 5706.0 0.018063 0.004894 0.008400 0.014900 0.017900
PE 6497.0 19.750139 4.267477 10.949800 16.581000 18.395300
BM 6497.0 0.371955 0.088411 0.192378 0.323687 0.369440
CAPE 6275.0 25.824579 6.981803 11.849780 20.973447 24.878816
PCAprice 5706.0 -3.125544 3.082865 -17.258065 -4.958795 -2.354091
BY 6249.0 0.977558 0.105177 0.566707 0.915942 0.972425
DEF 6231.0 0.954645 0.413315 0.430000 0.700000 0.870000
TERM 6485.0 1.865422 1.158198 -0.989000 0.916300 1.994900
CAY 6275.0 0.000324 0.016228 -0.031944 -0.012895 -0.002148
SIM 6252.0 0.742821 0.324054 0.007692 0.484615 0.976923
VRP 6272.0 0.066305 0.038604 -0.141507 0.042648 0.059513
BDI 6246.0 0.044917 0.324404 -0.900719 -0.133461 0.012865
NOS 6191.0 0.010533 0.043129 -0.193359 -0.011152 0.010640
CPI 6275.0 0.023318 0.011918 -0.020422 0.016667 0.024161
PCR 6275.0 -1.361110 0.363751 -2.260664 -1.609558 -1.412000
MA 6497.0 0.769432 0.421229 0.000000 1.000000 1.000000
OIL 6252.0 0.012821 0.179430 -1.132002 -0.079546 0.029171
SI 2226.0 3.689411 0.744130 1.952062 3.182868 3.719568
SPXR_22D 6253.0 0.007297 0.045672 -0.297937 -0.016283 0.010915
SPXR_65D 6210.0 0.022014 0.076810 -0.409638 -0.013432 0.028260
SPXR_130D 6145.0 0.046397 0.113950 -0.474598 -0.003915 0.055771
SPXR_252D 6023.0 0.091534 0.169579 -0.488228 0.028048 0.110181
75% max
DP 0.020300 0.040000
PE 23.155400 30.720600
BM 0.437982 0.688610
CAPE 27.720910 47.255292
PCAprice -0.651261 0.001848
BY 1.042780 1.457659
DEF 1.050000 3.500000
TERM 2.793000 3.863000
CAY 0.012844 0.031044
SIM 1.000000 1.000000
VRP 0.082342 0.372942
BDI 0.196221 2.320175
NOS 0.033015 0.253546
CPI 0.029222 0.059571
PCR -1.140890 -0.333101
MA 1.000000 1.000000
OIL 0.119024 0.771293
SI 4.202039 5.760021
SPXR_22D 0.033889 0.224057
SPXR_65D 0.067303 0.388187
SPXR_130D 0.112865 0.541292
SPXR_252D 0.200745 0.685735
你可以看到观察结果并不一致,如果它们的行数较少,它们基本上都应该是6253或更少。我的串联是否正确处理附加数据帧中的额外行? 编辑:在所有连接之后,我的初始列中似乎存在许多空白。有没有办法让panda只添加数据帧A已经拥有的数据帧B中的行?
答案 0 :(得分:1)
我得到了一些想法:
pd.concat
将混合数据框指示并为缺失值生成NA,尝试设置join=inner
参数
如果您可以按列重新组织数据创建 - 也可以尝试merge
代替https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html,可能需要左合并
不要用.dropna()
关闭NA值 - 它们可能会显示插入了哪些额外的行
尝试关闭第二次连续呼叫 - 以便您可以决定两者中的哪一个引起麻烦