计算文本中词典的键的频率

时间:2012-05-08 15:06:19

标签: python

我有一个单词的词典。对于dict中的每个键,我想在文章中找到它的频率。

在我打开文章后,我做了

for k, v in sourted_key.items():
    for token in re.findall(k, data)
        token[form] += 1
're.findall(k,data)'中的

键必须是字符串。但是这个词中的关键不是。我想搜索键。还有其他方法吗?请注意,KEYS包含许多PUNCTUATIONS。

e.g。如果钥匙是“手”。它只匹配手。钱德勒不方便。

7 个答案:

答案 0 :(得分:6)

在Python 2.7+中,你可以使用collections.Counter

import re, collections

text = '''Nullam euismod magna et ipsum tristique suscipit. Aliquam ipsum libero, cursus et rutrum ut, suscipit id enim. Maecenas vel justo dolor. Integer id purus ante. Aliquam volutpat iaculis consectetur. Suspendisse justo sapien, tincidunt ut consequat eget, fringilla id sapien. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Praesent mattis velit vitae libero luctus posuere. Vestibulum ac erat nibh, vel egestas enim. Ut ac eros ipsum, ut mattis justo. Praesent dignissim odio vitae nisl hendrerit sodales. In non felis leo, vehicula aliquam risus. Morbi condimentum nunc sit amet enim rutrum a gravida lacus pharetra. Ut eu nisi et magna hendrerit pharetra placerat vel turpis. Curabitur nec nunc et augue tristique semper.'''

c = collections.Counter(w.lower() for w in re.findall(r'\w+|[.,:;?!]', text))
words = set(('et', 'ipsum', ',', '?'))
for w in words:
  print('%s: %d' % (w, c.get(w, 0)))

答案 1 :(得分:3)

my_text = 'abc,abc,efr,sdgret,er,ttt,'

my_dict = {'abc':0, 'er': 0}

for word in my_text.split(','):
    if word in my_dict:
        my_dict[word] += 1

结果:

>>> my_dict
{'abc': 2, 'er': 1}

编辑:更一般的解决方案

对于普通文章,我们需要使用正则表达式:

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"
my_dict = {'IS': 0, 'TRUE': 0}

words = re.findall(r'\w+', my_string)
cap_words = [word.upper() for word in words]

for word in cap_words:
    if word in my_dict:
        my_dict[word] += 1

结果:

>>> my_dict
{'IS': 2, 'TRUE': 1}

答案 2 :(得分:2)

我会喜欢那个

tokens = {} 
d= {"a":1,"b":2}
data = "abca"
for k in d.keys():
    tokens[k] = data.count(k)

答案 3 :(得分:1)

尝试re.findall( re.escape( k ), data )以确保“字词”中的特殊字符不会导致问题。

但我的猜测是,这不是你的问题。 findall()的结果是匹配列表,而不是字符串。 re.MatchObject未定义__getitem__,这意味着[form]无效。

您可能意味着counts[token.group()] += 1其中countsdictionary with default value 0

答案 4 :(得分:1)

选项A

import re

text = """Now is the time for all good men to come to the aid of their country.  A man is only as good as all his thoughts."""
words = dict()

for word in re.findall('[^ .;]+', text):
    if words.get(word.lower(), False):
        words[word.lower()] += 1
    else:
        words[word.lower()] = 1

print words

这会产生......

{'a': 1, 'all': 2, 'good': 2, 'for': 1, 'their': 1, 'of': 1, 
'is': 2, 'men': 1, 'as': 2, 'country': 1, 'to': 2, 'only': 1, 
'his': 1, 'time': 1, 'aid': 1, 'the': 2, 'now': 1, 'come': 1, 
'thoughts': 1, 'man': 1}

选项B:使用defaultdict

import re
from collections import defaultdict

text = """Now is the time for all good men to come to the aid of their country.  A man is only as good as all his thoughts."""
words = defaultdict(int)

for word in re.findall('[^ .;]+', text):
    words[word.lower()] += 1

print words

这导致......

defaultdict(<type 'int'>, {'a': 1, 'all': 2, 'good': 2, 'for': 1, 
'their': 1, 'of': 1, 'is': 2, 'men': 1, 'as': 2, 'country': 1, 'to': 2, 
'only': 1, 'his': 1, 'time': 1, 'aid': 1, 'the': 2, 'now': 1, 'come': 1, 
'thoughts': 1, 'man': 1})

答案 5 :(得分:0)

article = "I have a dict of words. For each key in the dict, I want to find its frequency in an article"

words = {"dict", "i", "in", "key"} # set of words


wordsFreq = {}

wordsInArticle = tuple(word.lower() for word in atricle.split(" "))

for word in wordsInArticle:
  if word in wordsFreq:
    wordsFreq[word]= wordsFreq[word] + 1 if word in wordsFreq else 1

答案 6 :(得分:0)

因为每个人都在摇摆......

与此标记的区别在于将文本与标点符号分开的正则表达式。我使用\b\w+\b

import re 

article='''Richard II (13671400) was King of England, a member of the House of Plantagenet and the last of its main-line kings. He ruled from 1377 until he was deposed in 1399. Richard was a son of Edward, the Black Prince, and was born during the reign of his grandfather, Edward III. Richard was tall, good-looking and intelligent. Although probably not insane, as earlier historians believed, he may have suffered from one or several personality disorders that may have become more apparent toward the end of his reign. Less of a warrior than either his father or grandfather, he sought to bring an end to the Hundred Years' War that Edward III had started. He was a firm believer in the royal prerogative, which led him to restrain the power of his nobility and rely on a private retinue for military protection instead. He also cultivated a courtly atmosphere where the king was an elevated figure, and art and culture were at the centre, in contrast to the fraternal, martial court of his grandfather. Richard's posthumous reputation has to a large extent been shaped by Shakespeare, whose play Richard II portrays Richard's misrule and Bolingbroke's deposition as responsible for the 15th-century Wars of the Roses. Most authorities agree that the way in which he carried his policies out was unacceptable to the political establishment, and this led to his downfall.'''
words = {}

for word in re.findall(r'\b\w+\b', article):
    word=word.lower()
    if word in words:
        words[word]+=1
    else:
        words[word]=1    

print [(k,v) for v, k in sorted(((v, k) for k, v in words.items()), reverse=True)] 

打印出按频率排序的(字,计数)元组列表:

[('the', 15), ('of', 11), ('was', 8), ('and', 8), ('to', 7), ('his', 7), ('he', 7), 
 ('a', 7), ('richard', 6), ('in', 4), ('that', 3), ('s', 3), ('grandfather', 3), 
 ('edward', 3), ('which', 2), ('reign', 2), ('or', 2), ('may', 2), ('led', 2), 
 ('king', 2), ('iii', 2), ('ii', 2), ('have', 2), ('from', 2), ('for', 2), ('end', 2), 
 ('as', 2), ('an', 2), ('years', 1), ('whose', 1), ('where', 1), ('were', 1), ('way', 1), ('wars', 1), ('warrior', 1), ('war', 1), ('until', 1), ('unacceptable', 1), ('toward', 1), ('this', 1), ('than', 1), ('tall', 1), ('suffered', 1), ('started', 1), ('sought', 1), ('son', 1), ('shaped', 1), ('shakespeare', 1), ('several', 1), ('ruled', 1), ('royal', 1), ('roses', 1), ('retinue', 1), ('restrain', 1), ('responsible', 1), ('reputation', 1), ('rely', 1), ('protection', 1), ('probably', 1), ('private', 1), ('prince', 1), ('prerogative', 1), ('power', 1), ('posthumous', 1), ('portrays', 1), ('political', 1), ('policies', 1), ('play', 1), ('plantagenet', 1), ('personality', 1), ('out', 1), ('one', 1), ('on', 1), ('not', 1), ('nobility', 1), ('most', 1), ('more', 1), ('misrule', 1), ('military', 1), ('member', 1), ('martial', 1), ('main', 1), ('looking', 1), ('line', 1), ('less', 1), ('last', 1), ('large', 1), ('kings', 1), ('its', 1), ('intelligent', 1), ('instead', 1), ('insane', 1), ('hundred', 1), ('house', 1), ('historians', 1), ('him', 1), ('has', 1), ('had', 1), ('good', 1), ('fraternal', 1), ('firm', 1), ('figure', 1), ('father', 1), ('extent', 1), ('establishment', 1), ('england', 1), ('elevated', 1), ('either', 1), ('earlier', 1), ('during', 1), ('downfall', 1), ('disorders', 1), ('deposition', 1), ('deposed', 1), ('culture', 1), ('cultivated', 1), ('courtly', 1), ('court', 1), ('contrast', 1), ('century', 1), ('centre', 1), ('carried', 1), ('by', 1), ('bring', 1), ('born', 1), ('bolingbroke', 1), ('black', 1), ('believer', 1), ('believed', 1), ('been', 1), ('become', 1), ('authorities', 1), ('atmosphere', 1), ('at', 1), ('art', 1), ('apparent', 1), ('although', 1), ('also', 1), ('agree', 1), ('15th', 1), ('1399', 1), ('1377', 1), ('13671400', 1)]