将最常出现的值保留在python列表中

时间:2018-05-09 05:44:07

标签: python pandas

我正在从文本语料库创建一个单词包,并且我试图限制我的词汇量,因为当我尝试将我的列表转换为pandas数据帧时程序会冻结。我使用Counter来计算每个单词的出现次数:

from collections import Counter
bow = []
# corpus is list of text samples where each text sample is a list of words with variable length
for tokenized_text in corpus:
    clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
    bow.append(Counter(clean_text))
# Program freezes here
df_bows = pd.DataFrame.from_dict(bow)

我的输入是长度为num_samples的标记列表,其中每个文本样本都是标记列表。对于我的输出,我想要一个带有形状的pandas DataFrame(num_samples,10000),其中10000是我的词汇量的大小。之前,我的df_bows词汇量(df_bows.shape[1])会变得非常大(大于50,000)。 如何从我的bow计数器对象列表中选择10,000个最常出现的单词,然后将其置于DataFrame中,同时保留多少个文本样本?

3 个答案:

答案 0 :(得分:3)

要查找总体前10000个单词,最简单的方法是update a global Counter

from collections import Counter
global_counter = Counter() # <- create a counter
for tokenized_text in corpus:
    clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
    global_counter.update(clean_text) # <- update it

此时,您可以使用

import pandas as pd
df = pd.DataFrame(global_counter.most_common(10000))

如果您想查找特定条目的字数,请立即添加以下代码(在上一个代码之后)。

most_common = set([t[0] for t in global_counter.most_common(10000)])
occurrences = []
for tokenized_text in corpus:
    clean_text = dict(collections.Counter([tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]))
    occurrences.append({c: clean_text.get(c, 0) for c in most_common})

现在只需使用

pd.DataFrame(occurrences)

答案 1 :(得分:0)

Counter.most_common(n)会返回最常见的n个元素。

此处:https://docs.python.org/3/library/collections.html#collections.Counter.most_common

from collections import Counter

myStr = "It was a very, very good presentation, was it not?"
C = Counter(myStr.split())
C.most_common(2)

# [('was', 2), ('It', 1)]

答案 2 :(得分:0)

通过使用counter most_comman帮助功能,您可以最常出现单词:

from collections import Counter
clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
counter = Counter(clean_text)
counter.most_common(10000)