字频HW

时间:2019-04-18 04:28:35

标签: python-3.x

编写一个程序,询问用户文件名,然后读入文件。然后,程序应确定文件中每个单词的使用频率。无论大小写,都应该对单词进行计数,例如,垃圾邮件和垃圾邮件都将被视为同一单词。您应该忽略标点符号。然后,程序应输出单词以及每个单词的使用频率。输出应按频率最高的单词到频率最低的单词排序。

我遇到的唯一问题是使代码将“ The”和“ the”视为同一事物。代码会将它们视为不同的单词。

userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
    lin = lin.rstrip()
    wds = lin.split()
    for w in wds:
        di[w] = di.get(w,0) + 1
    lst = list()
    for k,v in di.items():
       newtup = (v, k)
       lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)

需要将“ the”和“ The”视为单个单词。

2 个答案:

答案 0 :(得分:1)

我们首先获取列表中的单词,然后更新列表以使所有单词都小写。您可以通过使用空字符将其替换为字符串来忽略标点符号


punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"

for punc in punctuations:
    s = s.replace(punc,' ')

words = s.split(' ')
words = [word.lower() for word in words]

然后我们遍历列表,并更新频率图。

freq = {}

for word in words:
    if word in freq:
        freq[word] += 1
    else:
        freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2, 
#'words': 2, 'are': #2, 'there': 2}

答案 1 :(得分:1)

您可以像这样使用counter并重新输入

from collections import Counter
import re

sentence = 'Egg ? egg Bird, Goat  afterDoubleSpace\nnewline'

# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?" 

#to make all of them in lower case
sentence = sentence.lower() 

#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence) 

# getting the word list
words = sentence.split()

# printing the frequency of each word
print(Counter(words))