如何按字母顺序查找最常用的单词?

时间:2013-11-19 21:31:06

标签: python word frequency numerical alphabetical

我试图在这个不同的程序中按字母顺序在文本文件中找到最常用的单词。

例如,单词:“that”是文本文件中最常用的单词。因此,应首先打印:“那#”

它需要采用这种格式作为程序和下面的答案:

d = dict()

def counter_one():
    d = dict()
    word_file = open('gg.txt')
    for line in word_file:
        word = line.strip().lower()
        d = counter_two(word, d)
    return d

def counter_two(word, d):
    d = dict()
    word_file = open('gg.txt')
    for line in word_file:
        if word not in d:
            d[word] = 1
        else:
            d[word] + 1
    return d

def diction(d):
    for key, val in d.iteritems():
        print key, val

counter_one()
diction(d)

它应该在shell中运行这样的东西:

>>>
Words in text: ###
Frequent Words: ###
that 11
the 11
we 10
which 10
>>>

3 个答案:

答案 0 :(得分:3)

获取频率计数的一种简单方法是使用内置集合模块中的Counter class。它允许您传入一个单词列表,它会自动计算所有单词并将每个单词映射到它的频率。

from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
  for line in f:
    frequencies.update(line.lower().split())

我使用lower()函数来避免单独计算“the”和“The”。

然后,如果您只想要顶部frequencies.most_common(),则可以使用frequencies.most_common(n)n按频率顺序输出它们。

如果要按频率对结果列表进行排序,然后按字母顺序对具有相同频率的元素进行排序,则可以使用sorted内置函数,key参数为lambda (x,y): (y,x)。所以,你的最终代码是:

from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
  for line in f:
    frequencies.update(line.lower().split())
most_frequent = sorted(frequencies.most_common(4), key=lambda (x,y): (y,x))
for (word, count) in most_frequent:
  print word, count

然后输出

that 11
the 11
we 10
which 10

答案 1 :(得分:1)

您可以使用集合Counter更简单地完成此操作。首先,计算单词,然后按每个单词的出现次数和单词本身排序:

from collections import Counter

# Load the file and extract the words
lines = open("gettysburg_address.txt").readlines()
words = [ w for l in lines for w in l.rstrip().split() ]
print 'Words in text:', len(words)

# Use counter to get the counts
counts = Counter( words )

# Sort the (word, count) tuples by the count, then the word itself,
# and output the k most frequent
k = 4
print 'Frequent words:'
for w, c in sorted(counts.most_common(k), key=lambda (w, c): (c, w), reverse=True):
    print '%s %s' % (w, c)

输出:

Words in text: 278
Frequent words:
that 13
the 9
we 8
to 8

答案 2 :(得分:1)

为什么要继续重新打开文件并创建新词典?您的代码需要做什么?

create a new empty dictionary to store words {word: count}
open the file
work through each line (word) in the file
    if the word is already in the dictionary
        increment count by one
    if not
        add to dictionary with count 1

然后您可以轻松获得单词数

len(dictionary)

n最常见的单词及其计数

sorted(dictionary.items(), key=lambda x: x[1], reverse=True)[:n]