我正在学习使用python和pandas,我想知道如何根据两列中的单元格值组合来自不同行的内容。 在这种情况下,我想合并同一部门和同一ID内的分数。
这是我拥有的DF的简化版本:
id department name scoreA
abc sales eric 2
abc sales jack 3
abc marketing sofia 6
abc marketing anna 7
zzz sales jack 1
zzz sales eric 8
zzz marketing sofia 11
zzz marketing anna 1
这是我想要的DF:
id department totalScoreA
abc sales 5
abc marketing 13
zzz sales 9
zzz marketing 12
我还有一个后续问题。如果有两列计数,并且我想要这些计数的平均值,但是在平均那些值之前,我想将scoreB乘以2。像这样:
id department name scoreA scoreB
abc sales eric 2 10
abc sales jack 3 6
abc marketing sofia 6 8
abc marketing anna 7 10
zzz sales eric 8 10
zzz sales jack 2 10
zzz marketing sofia 11 4
zzz marketing anna 1 10
这是我想要的DF:
id department totalScoreA AverageScore((A+B*2)/2)
abc sales 5 18.5
abc marketing 13 24.5
zzz sales 10 25
zzz marketing 12 20
嘿,非常感谢@jezrael的回答!第一个按预期工作!
但是对于定义第二个问题,我可能有点含糊不清。我想要的是获得每个部门每个组的所有scoreB*2
的{{1}}的“组合”平均值。我举一个带有值的例子来阐明这一点:
从此:
ScoreA
对此:
id department name scoreA scoreB
zzz marketing sofia 5 4
zzz marketing anna 2
是meanAB
(数字3来自值的计数)。那么,即使在先前的解决方案的帮助下,我也无法做到这一点,我将如何计算它:/
(5+2+4*2)/3
答案 0 :(得分:6)
首先是聚合sum
与DataFrame.groupby
:
df1 = df.groupby(['id', 'department'], as_index=False, sort=False)['scoreA'].sum()
print (df1)
id department scoreA
0 abc sales 5
1 abc marketing 13
2 zzz sales 10
3 zzz marketing 12
第二个是第一个多列scoreB
,添加了scoreA
,并将GroupBy.agg
与聚合函数字典一起使用,此处为sum
和mean
:
df2 = (df.assign(scoreB = df['scoreB'] * 2 + df['scoreA'])
.groupby(['id', 'department'], as_index=False, sort=False)
.agg({'scoreA':'sum', 'scoreB':'mean'}))
print (df2)
id department scoreA scoreB
0 abc sales 5 18.5
1 abc marketing 13 24.5
2 zzz sales 10 25.0
3 zzz marketing 12 20.0
编辑:
print (df)
id department name scoreA scoreB
0 abc sales eric 2 10.0
1 abc sales jack 3 6.0
2 abc marketing sofia 6 8.0
3 abc marketing anna 7 10.0
4 abc marketing eric 8 10.0 <-changed data
5 zzz sales jack 2 10.0
6 zzz marketing sofia 5 4.0 <-changed data
7 zzz marketing anna 2 NaN <-changed data
通过DataFrame.count
函数创建新列Count
,以获取带有排除NaN的值的数量,然后聚合sum
并除以mean
:
df2 = (df.assign(scoreB = df['scoreB'].mul(2).add(df['scoreA'], fill_value=0),
Count = df[['scoreA','scoreB']].count(1))
.groupby(['id', 'department'], as_index=False, sort=False)
.sum())
print (df2)
id department scoreA scoreB Count
0 abc sales 5 37.0 4
1 abc marketing 21 77.0 6
2 zzz sales 2 22.0 2
3 zzz marketing 7 15.0 3
df2['scoreB'] /= df2.pop('Count')
print (df2)
id department scoreA scoreB
0 abc sales 5 9.250000
1 abc marketing 21 12.833333
2 zzz sales 2 11.000000
3 zzz marketing 7 5.000000
详细信息:
print (df.assign(scoreB = df['scoreB'].mul(2).add(df['scoreA'], fill_value=0),
Count = df[['scoreA','scoreB']].count(1)))
id department name scoreA scoreB Count
0 abc sales eric 2 22.0 2
1 abc sales jack 3 15.0 2
2 abc marketing sofia 6 22.0 2
3 abc marketing anna 7 27.0 2
4 abc marketing eric 8 28.0 2
5 zzz sales jack 2 22.0 2
6 zzz marketing sofia 5 13.0 2
7 zzz marketing anna 2 2.0 1