pd.DataFrame.from_dict()没有给出预期的结果

时间:2019-01-21 22:57:51

标签: python dictionary dataframe word-count

我是Python编程的新手。我想获取此Wikipedia数据集(people_wiki.csv)中每个单词的单词计数。我能够获取每个单词,并且它作为字典出现,但无法将字典键值对拆分为单独的列。我尝试了几种方法(from_dict,from_records,to_frame,pivot_table等),这在python中是否可行?我将不胜感激。

样本数据集:

URI                                           name             text

<http://dbpedia.org/resource/George_Clooney>  George Clooney   'george timothy clooney born may 6 1961 is an american actor writer producer director and activist he has received three golden globe awards for his work as an actor and two academy awards one for acting and the other for producingclooney made his...'

我尝试过:

clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']

我也尝试过:

clooney['word_count'].to_frame()

这是我的代码:

people = pd.read_csv("people_wiki.csv")
clooney = people[people['name'] == 'George Clooney']

from collections import Counter
clooney['word_count']= clooney['text'].apply(lambda x: Counter(x.split(' ')))

clooney_word_count_table = pd.DataFrame.from_dict(clooney['word_count'], orient='index', columns=['word','count']
clooney _word_count_table

输出:

       word_count
35817   {'george': 1, 'timothy': 1, 'clooney': 9, 'ii': ...

我希望从clooney_word_count_table中获得具有2列的输出数据框:

word      count
normalize  1
george     3
combat     1
producer   2

1 个答案:

答案 0 :(得分:0)

问题在于clooney是一个DataFrame(包含索引为35817的一行),因此clooney['word_count']Series,其中包含索引为35817的一个值(您的计数字典)。 / p>

DataFrame.from_dict然后将该系列视为与{35817: {'george': 1,...}等效,这给您带来混乱的结果。

对此进行调整,并假设您要在许多条目上产生组合的字数:

from collections import Counter
import pandas as pd

# Load the wikipedia entries and select the ones we care about
people = pd.read_csv("people_wiki.csv")
people_to_process = people[people['name'] == 'George Clooney']

# Compute the counts for these entries
counts = Counter()
people_to_process['text'].apply(lambda text: counts.update(text.split(' ')))

# Transform the counter into a DataFrame
count_table = pd.DataFrame.from_dict(counts, orient='index', columns=['count'])
count_table