在Pandas DataFrame中分组并计算不同的单词

时间:2016-05-29 10:15:09

标签: python pandas dataframe group-by distinct-values

按年份和名称,我希望计算从Excel导入的数据框中单词的出现次数,结果也会导出到Excel。

这是示例代码:

source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'], 
                  'Year' : ['1999', '2000', '2000','2000'],
                  'Message' : [

'I Love You','Will Remember You','Love','I Love You]})

数据框中的以下结果如下。有什么想法吗?

Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1

2000 Mike Will 1 
2000 Mike Remember 1
2000 Mike You 1 
2000 John Love 2
2000 John I 1
2000 John You 1

1 个答案:

答案 0 :(得分:2)

我认为您可以先splitMessage,创建Serie并将其添加到原始source。最后groupbysize

#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0           I
0        Love
0         You
1        Will
1    Remember
1         You
2        Love
3           I
3        Love
3         You
Name: Message, dtype: object

#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))

#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
   Year  Name   Message  count
0  1999  John         I      1
1  1999  John      Love      1
2  1999  John       You      1
3  2000  John         I      1
4  2000  John      Love      2
5  2000  John       You      1
6  2000  Mike  Remember      1
7  2000  Mike      Will      1
8  2000  Mike       You      1