遍历目录并计算所有文件和子目录中的单词并累计总计

时间:2018-02-28 00:59:39

标签: python word-count os.walk

Hello stackoverflow社区!多年来,我一直利用这个社区来完成工作,学校和个人探索的小型项目;然而,这是我发布的第一个问题......所以要精致;)

我试图从目录和所有子目录中读取每个文件,然后使用Python将结果累积到一个字典中。现在,脚本(见下文)正在根据需要读取所有文件,但每个文件的结果都是单独的。我正在寻求帮助积累成一个。

代码

import re
import os
import sys
import os.path
import fnmatch
import collections

def search( file ):

    if os.path.isdir(path) == True:
        for root, dirs, files in os.walk(path):
            for file in files:
              #  words = re.findall('\w+', open(file).read().lower())
                words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
                ignore = ['the','a','if','in','it','of','or','on','and','to']
                counter=collections.Counter(x for x in words if x not in ignore)
                print(counter.most_common(10))

    else:
        words = re.findall('\w+', open(path).read().lower())
        ignore = ['the','a','if','in','it','of','or','on','and','to']
        counter=collections.Counter(x for x in words if x not in ignore)
        print(counter.most_common(10))

path = raw_input("Enter file and path")

结果

Enter file and path./dirTest

[('this', 1), ('test', 1), ('is', 1), ('just', 1)]

[('this', 1), ('test', 1), ('is', 1), ('just', 1)]

[('test', 2), ('is', 2), ('just', 2), ('this', 1), ('really', 1)]

[('test', 3), ('just', 2), ('this', 2), ('is', 2), ('power', 1),
('through', 1), ('really', 1)]

[('this', 2), ('another', 1), ('is', 1), ('read', 1), ('can', 1),
('file', 1), ('test', 1), ('you', 1)]

期望的结果 - 示例

[('this', 5), ('another', 1), ('is', 5), ('read', 1), ('can', 1),
('file', 1), ('test', 5), ('you', 1), ('power', 1), ('through', 1),
('really', 2)]

非常感谢任何指导!

3 个答案:

答案 0 :(得分:0)

问题在于您的print语句以及Counter对象的用法。我建议遵循。

ignore = ['the', 'a', 'if', 'in', 'it', 'of', 'or', 'on', 'and', 'to']

def extract(file_path, counter):
    words = re.findall('\w+', open(file_path).read().lower())
    counter.update([x for x in words if x not in ignore])

def search(file):
    counter = collections.Counter()

    if os.path.isdir(path):
        for root, dirs, files in os.walk(path):
            for file in files:
                extract(os.path.join(root, file), counter)
    else:
        extract(path, counter)

    print(counter.most_common(10))

您可以分开常见的代码行。此外,os.path.isdir(path)会返回一个bool值,因此您可以直接将其用于if条件,而无需进行比较。

初步解决方案: 我的解决方案是将您的所有单词附加到一个list,然后将该列表与Counter一起使用。这样你可以用你的结果产生一个输出。

根据@ShadowRanger提到的性能影响,您可以直接更新计数器,而不是使用单独的列表。

答案 1 :(得分:0)

您希望单个Counter包含您在结尾处打印的所有累计统计信息,但您要为每个文件创建Counter,然后将其打印出来,然后将其丢弃。您只需将Counter初始化和print移到您的循环之外,并且每个文件只需update“一个真Counter”:

def search( file ):
    # Initialize empty Counter up front
    counter = Counter()
    # Create ignore only once, and make it a set, so membership tests go faster
    ignore = {'the','a','if','in','it','of','or','on','and','to'}
    if os.path.isdir(path):  # Comparing to True is anti-pattern; removed
        for root, dirs, files in os.walk(path):
            for file in files:
                words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
                # Update common Counter
                counter.update(x for x in words if x not in ignore)

    else:
        words = re.findall('\w+', open(path).read().lower())
        # Update common Counter
        counter.update(x for x in words if x not in ignore)
    # Do a single print at the end
    print(counter.most_common(10))

如果您愿意,可以在此处分解公共代码,例如:

def update_counts_for_file(path, counter, ignore=()):
    with open(path) as f:  # Using with statements is good, always do it
        words = re.findall('\w+', f.read().lower())
    counter.update(x for x in words if x not in ignore)

允许您通过调用分解代码替换重复代码,但除非代码变得更复杂,否则可能不值得将两行重复两次。

答案 2 :(得分:-1)

我看到您正在尝试从文件/目录扫描中查找某些关键字并获取发生次数

基本上你可以获得所有这些事件的列表,然后找到每个事件的计数

def couunt_all(array):
    nodup = list(set(array))
    for i in nodup:
        print(i,array.count(i))        

array = ['this','this','this','is','is']
print(couunt_all(array))
out:
('this', 3)
('is', 2)