如何计算语料库文档中的单词

时间:2011-11-15 15:59:02

标签: python nltk

我想知道计算文档中单词的最佳方法。如果我有自己的“corp.txt”语料库设置,我想知道“corp.txt”文件中出现“学生,信任,ayre”的频率。我可以用什么?

它会是以下之一:

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

谢谢, 射线

4 个答案:

答案 0 :(得分:4)

我建议调查集合。计数器。特别是对于大量文本,这可以解决问题,并且仅受可用内存的限制。它在一天半的时间里计算出了带有12Gb内存的计算机上的30亿个令牌。伪代码(变量单词实际上是对文件或类似文件的引用):

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后,单词会出现在字典my_counter中,然后可以写入磁盘或存储在别处(例如sqlite)。

答案 1 :(得分:3)

大多数人只会使用defaultdictionary(默认值为0)。每次看到一个单词时,只需将值增加一个:

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

答案 2 :(得分:2)

你快到了!您可以使用您感兴趣的单词索引FreqDist。 请尝试以下方法:

print fdist['students']
print fdist['ayre']
print fdist['full']

这将为您提供每个单词的计数或出现次数。 你说“频率” - 频率与出现次数不同 - 可以这样:

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

答案 3 :(得分:0)

您可以阅读文件然后将其标记并将各个令牌放入FreqDist中的NLTK对象,请参阅http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[OUT]:

'blah' occurred 3 times

或者,您可以使用Counter中的原生collections对象获得相同的计数,请参阅https://docs.python.org/2/library/collections.html。请注意,FreqDist或Counter对象中的键区分大小写,因此您可能还希望将标记大小写为小写:

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"