我遇到了以下问题。我有df_input
作为我的输入数据框,它只包含一个名为Site_Sector的列。 Site_Sector具有以下结构:
Site_Sector
--------------
DEP_1234
TRE_5421
YUT_0901
IOP_ABC3
POS_3456
MEC_2341
XAZ_4532
QPI_9012
KPI_1200
LPO_1300
KIN_9012
SVP_0001
....
JOP_1289
我有3个名为df_cr,df_gt和df_ba的数据框,它们包含在列表list_of_dfs = [df_cr,df_gt,df_ba]
中。它们具有以下结构(我将仅键入两个数据框):
#let's consider some data of df_cr as example
| Date | Site | Sector | KPI_1 | QA_value | Active |
| --------- |---------- |----------|----------|----------| ------ |
09/12/2015 CR_XAZ XAZ_4532 50.0 100.0 Y
09/12/2015 CR_PET PET_2312 50.0 100.0 Y
09/13/2015 CR_XAZ XAZ_4532 50.0 100.0 Y
09/13/2015 CR_PET PET_2312 50.0 100.0 Y
09/14/2015 CR_XAZ XAZ_4532 30.0 60.0 Y
09/14/2015 CR_PET PET_2312 25.0 50.0 N
09/15/2015 CR_XAZ XAZ_4532 25.0 50.0 N
09/15/2015 CR_PET PET_2312 40.0 80.0 Y
09/16/2015 CR_XAZ XAZ_4532 35.0 70.0 Y
09/16/2015 CR_PET PET_2312 45.0 90.0 Y
09/17/2015 CR_XAZ XAZ_4532 15.0 30.0 N
09/17/2015 CR_PET PET_2312 50.0 100.0 Y
.....
09/25/2015 CR_XAZ PET_4532 12.0 24.0 N
09/25/2015 CR_PET XAZ_2312 12.0 24.0 N
#let's consider some data of df_ba as example
| Date | Site | Sector | KPI_1 | QA_value | Active |
| --------- |--------- |----------| ---------|----------| ------ |
09/12/2015 CR_DEP DEP_1234 35.0 70.0 Y
09/12/2015 CR_XZT XZT_1212 50.0 100.0 Y
09/13/2015 CR_DEP DEP_1234 15.0 30.0 N
09/13/2015 CR_XZT XZT_1212 50.0 100.0 Y
09/14/2015 CR_DEP DEP_1234 35.0 70.0 Y
09/14/2015 CR_XZT XZT_1212 25.0 50.0 Y
09/15/2015 CR_DEP DEP_1234 25.0 50.0 Y
09/15/2015 CR_XZT XZT_1212 40.0 80.0 Y
09/16/2015 CR_DEP DEP_1234 15.0 30.0 N
09/16/2015 CR_XZT XZT_1212 45.0 90.0 Y
09/17/2015 CR_DEP DEP_1234 50.0 100.0 Y
09/17/2015 CR_XZT XZT_1212 50.0 100.0 Y
.....
09/25/2015 CR_DEP DEP_1234 10.0 20.0 N
09/25/2015 CR_XZT XZT_1212 50.0 100.0 Y
我的目标是将Site_Sector列数据框的每个值与列表中包含的每个数据框的每个Sector列进行比较。如果Site_Sector和Sector列之间存在匹配,则将Date,KPI_1,QA_value和Active列添加到df_input数据框中。
#expected output
Site_Sector| Date | KPI_1| QA_value | Active
----------------------------------------------------
DEP_1234 09/12/2015 35.0 70.0 Y
DEP_1234 09/13/2015 15.0 30.0 N
DEP_1234 09/14/2015 35.0 70.0 Y
DEP_1234 09/15/2015 25.0 50.0 N
....
XAZ_4532 09/12/2015 50.0 100.0 Y
XAZ_4532 09/13/2015 50.0 100.0 Y
XAZ_4532 09/14/2015 30.0 60.0 Y
XAZ_4532 09/15/2015 25.0 50.0 N
....
如果某些内容不明确或需要更多详细信息,请对此帖发表评论,我将很乐意解释更多。
答案 0 :(得分:2)
我使用列表理解 + pd.Series.isin
执行此操作:
data = df_input.Site_Sector
filtered_dfs = [x[x.Sector.isin(data)] for x in list_of_dfs]
output = pd.concat(filtered_dfs).drop('Site', 1)
您的意见是:
print(output.sort_values('Sector'))
Date Sector KPI_1 QA_value Active
0 09/12/2015 DEP_1234 35.0 70.0 Y
2 09/13/2015 DEP_1234 15.0 30.0 N
4 09/14/2015 DEP_1234 35.0 70.0 Y
6 09/15/2015 DEP_1234 25.0 50.0 Y
8 09/16/2015 DEP_1234 15.0 30.0 N
10 09/17/2015 DEP_1234 50.0 100.0 Y
12 09/25/2015 DEP_1234 10.0 20.0 N
0 09/12/2015 XAZ_4532 50.0 100.0 Y
2 09/13/2015 XAZ_4532 50.0 100.0 Y
4 09/14/2015 XAZ_4532 30.0 60.0 Y
6 09/15/2015 XAZ_4532 25.0 50.0 N
8 09/16/2015 XAZ_4532 35.0 70.0 Y
10 09/17/2015 XAZ_4532 15.0 30.0 N