Question

我想使用以下方法计算某些文档中使用的单词的频率：

Counter(word.rstrip(punctuation) for word in words).most_common(10)

我无法简单地将.subtract（exclusion_list）添加到此命令，其中exclusion_list是我不想要的单词列表。如何在不包含排除列表的情况下获得前十个单词？

Answer 1

要在排除列表中获取不是的前10个单词，那么这应该有效：

Counter(word.rstrip(punctuation) for word in words if word not in exclusion_list).most_common(10)

否则，如果由于某种原因你想获得前10个单词而然后排除排除列表中的单词，那么这应该有效：

[w for w in Counter(word.rstrip(punctuation) for word in words).most_common(10) if w[0] not in exclusion_list]

Answer 2

您可以使用list comprehension：

>>> words = ('proper prefix '+'1 2 3 4 5 6 7 8 9 A '*10+' proper suffix').split()
>>> exclusion_list = '1 3 5 7 9'.split()
>>> [w for w, c in Counter(words).most_common(10) if w not in exclusion_list]
['A', '2', '4', '6', '8']

如果您希望单词的元组与其计数相匹配：

>>> [(w, c) for w, c in Counter(words).most_common(10) if w not in exclusion_list]
[('A', 10), ('2', 10), ('4', 10), ('6', 10), ('8', 10)]

filter的另一种方式：

>>> filter(lambda wc: wc[0] not in exclusion_list, Counter(words).most_common(10))
[('A', 10), ('2', 10), ('4', 10), ('6', 10), ('8', 10)]

使用前十名从collections.counter中排除单词

2 个答案: