Question

你能帮我解释一下如何计算“一组词”的频率分布吗？

换句话说，我有一个文本文件。这是一个快照：

a snapshot of it is given here

以下是我的代码，用于查找文本文件中最常见的50个单词：

f=open('myfile.txt','rU')
text=f.read()
text1=text.split()
keywords=nltk.Text(text1)
fdist1=FreqDist(keywords)
fdist1.most_common(50)

在结果中，正如您在链接中看到的，每个单词都会被计算出来。以下是结果的屏幕截图：

a screenshot of the results

效果很好，但我试图找到文本文件中每一行的频率分布。例如，在第一行中，有一个术语“概念变化”。该程序将“概念”和“更改”计算为不同的关键字。但是，我需要找到“概念变化”一词的频率分布。

Answer 1

您正在通过任何空格分割文本。请参阅the docs，当您不提供任何分隔符时，这是默认行为。

如果要在示例程序中打印出text1的值，您会看到这一点。它只是一个单词列表 - 而不是行 - 所以损坏已经传递到FreqDist。

要修复它，只需替换为text.split("\n")：

import nltk
from nltk import FreqDist
f=open('myfile.txt','rU')
text=f.read()
text1=text.split("\n")
keywords=nltk.Text(text1)
print(type(keywords))
fdist1=FreqDist(keywords)
print(fdist1.most_common(50))

这给出了如下输出：

[('conceptual change', 1), ('coherence', 1), ('cost-benefit tradeoffs', 1), ('interactive behavior', 1), ('naive physics', 1), ('rationality', 1), ('suboptimal performance', 1)]

字组

1 个答案: