在python中填充字典

时间:2014-09-16 04:02:31

标签: python dictionary

我必须在多个文件中存储每个单词的字数。在Perl中,我使用了哈希,例如$wcCount{$file}{$word}。我无法弄清楚如何在python中做类似的事情。我尝试过使用这种效果,但显然没有效果

for line in fh:
    arr = line.split()
    for word in arr:
        key = filename + word  #creates a unique identifier for each word count
        freqdict[key] += 1

我读了另一个类似问题的stackoverflow,但是当它再次计算时,它不允许更新值。

输入是多个文件的单词。输出应该只是每个文件的一个单词的频率列表(作为命令行参数)。

5 个答案:

答案 0 :(得分:2)

假设您有Hamlet,并且您想要计算唯一的单词。

你可以这样做:

# the tools we need, read a url and regex library 
import urllib2
import re

# a dict -- similar to Perl hash
words={}

# read the text at that url
response = urllib2.urlopen('http://pastebin.com/raw.php?i=7p3uycAz')
hamlet = response.read()

# split on whitespace, remove trailing punctuation, and count each unique word
for word in hamlet.split():
    word=re.sub(r'\W+$', r'', word)
    if word.strip(): 
        words[word]=words.setdefault(word, 0) +1

然后,如果你想打印从最常见到最不重要的单词:

# sort descending on count, ascending on ascii lower case
for word, count in sorted(words.items(), key=lambda t: (-t[1], t[0].lower())):
    print word, count  

打印:

the 988
and 702
of 628
to 610
I 541
you 495
a 452
my 441
in 399
HAMLET 385
it 360
is 313
...

如果你想要一个嵌套的Dicts Dict(正如你的Perl示例所示)你可能会这样做:

# think of these strings like files; the letters like words
str1='abcdefaaa'
str2='abefdd'
str3='defeee'

letters={}

for fn, st in (('string 1', str1), ('string 2', str2) , ('string 3', str3)):
    letters[fn]={}
    for c in st:
        letters[fn][c]=letters[fn].setdefault(c, 0)
        letters[fn][c]+=1

print letters     
# {'string 3': {'e': 4, 'd': 1, 'f': 1}, 
   'string 1': {'a': 4, 'c': 1, 'b': 1, 'e': 1, 'd': 1, 'f': 1}, 
   'string 2': {'a': 1, 'b': 1, 'e': 1, 'd': 2, 'f': 1}}

答案 1 :(得分:1)

您可以使用Counter并使用元组(文件名,单词)作为键值,例如:

from collections import Counter
from itertools import chain

word_counts = Counter()
for filename in ['your', 'file names', 'here']:
    with open(filename) as fin:
        words = chain.from_iterable(line.split() for line in fin)
        word_counts.update((filename, word) for word in words)

但是,您还可以做的是创建一个基于文件名的初始字典,使用Counter,然后更新,以便您可以访问" hash"因为它是文件名作为键,然后是字数,例如:

word_counts = {filename: Counter() for filename in your_filenames}
for filename, counter in word_counts.items():
    with open(filename) as fin:
        words = chain.from_iterable(line.split() for line in fin)
        word_counts[filename].update(words)

答案 2 :(得分:0)

如果您使用的是Python 2.7或更高版本,我建议收藏.Counter:

import collections

counter = collections.Counter()

for line in fh:
    arr = line.split()
    for word in arr:
        key = filename + word  #creates a unique identifier for each word count
        counter.update((key,))

你可以查看这样的计数:

for key, value in counter.items():
    print('{0}: {1}'.format(key, value))

答案 3 :(得分:0)

我不是Perl程序员,但我相信Python中的以下解决方案会让你最接近Perl中的$wcCount{$file}{$word}

from collections import Counter
from itertools import chain

def count_words(filename):
    with open(filename, 'r') as f:
        word_iter = chain.from_iterable(line.split() for line in f)
        return Counter(word_iter)

word_counts = {file_name : count_words(file_name) for file_name in file_names}

答案 4 :(得分:0)

或者,您可以从了解nltk(自然语言工具包)中受益。 如果你最终做的不仅仅是单词频率,那么它可能会有很大的帮助。

这里解析句子然后解释:

import nltk
import urllib2

hamlet = urllib2.urlopen('http://pastebin.com/raw.php?i=7p3uycAz').read().lower()

word_freq = nltk.FreqDist()
for sentence in nltk.sent_tokenize(hamlet):
    for word in nltk.word_tokenize(sentence): 
        word_freq[word] += 1

word_freq:

  
    

FreqDist({',':3269,'。':1283,'':1138,'和': 965,'到':737,'':669,' i':629,&#39 ;;':582,&#39 ;你':553,':':535,...})