多个文本文件中特定单词的计数

时间:2021-06-21 13:36:45

标签: python

我有多个文本文件,我需要在这些文件中查找和计算特定单词并将它们写入 csv 文件。 A 列包含 txt 文件名,在标题中包含单词和每个文件名的计数。使用此代码,我获得了所有单词并需要过滤掉确切的单词

例如输出应该像我上传的图像文件

header = ['滥用'、'适应'、'适应'、'问责制']

import csv
folderpaths = 'C:/Users/haris/Downloads/PDF/'
counter = Counter()
filepaths = glob(os.path.join(folderpaths,'*.txt'))
for file in filepaths:
    with open(file) as f:
        words = re.findall(r'\w+', f.read().lower())
        counter = counter + Counter(words)
    print(counter)
f = open('C:/Users/haris/Downloads/PDF/firstcsv.csv', 'w')
writer = csv.writer(f)
for row in counter.items():
    writer.writerow(row)

enter image description here

Files uploaded to google drive

2 个答案:

答案 0 :(得分:1)

编辑:根据您的新要求,我添加了“total_words”列。代码已更新。

enter image description here


下面是一个有效的代码。只需将“folderpath”变量更改为包含文本文件的文件夹的路径,并将“target_file”变量更改为要创建输出 csv 文件的位置。

示例 csv 输出:

enter image description here

代码:

from collections import Counter
import glob
import os
import re

header = ['annual', 'investment', 'statement', 'range' , 'deposit' , 'supercalifragilisticexpialidocious']
folderpath = r'C:\Users\USERname4\Desktop\myfolder'
target_file = r'C:\Users\USERname4\Desktop\mycsv.csv'

queueWAP = []
def writeAndPrint(fileObject,toBeWAP,opCode=0):
    global queueWAP
    if (opCode == 0):
        fileObject.write(toBeWAP)
        print(toBeWAP)
    if (opCode == 1):
        queueWAP.append(toBeWAP)
    if (opCode == 2):
        for temp4 in range(len(queueWAP)):
            fileObject.write(queueWAP[temp4])
            print(queueWAP[temp4])
        queueWAP = []
mycsvfile = open(target_file, 'w')
writeAndPrint(mycsvfile,"file_name,total_words")
for temp1 in header:
    writeAndPrint(mycsvfile,","+temp1)
writeAndPrint(mycsvfile,"\n")
filepaths = glob.glob(folderpath + r"\*.txt")
for file in filepaths:
    with open(file) as f:
        writeAndPrint(mycsvfile,file.split("\\")[-1])
        counter = Counter()
        words = re.findall(r'\w+', f.read().lower())
        counter = counter + Counter(words)
        for temp2 in header:
            temp3 = False
            temp5 = 0
            for myword in counter.items():
                temp5 = temp5 + 1
                if myword[0] == temp2:
                    writeAndPrint(mycsvfile,","+str(myword[1]),1)
                    temp3 = True
            if temp3 == False:
                writeAndPrint(mycsvfile,","+"0",1)
        writeAndPrint(mycsvfile,","+str(temp5))
        writeAndPrint(mycsvfile,"",2)
        writeAndPrint(mycsvfile,"\n")
mycsvfile.close()

答案 1 :(得分:0)

在这里使用“Counter”似乎是正确的选择,但我认为您使用它是错误的。

这里有一个可能适合您的解决方案:

words = ['Abuse', 'Accommodating', 'Accommodation', 'Accountability']

rows = []
for file in filepaths:
  with open(file, 'r') as f:
    words_in_file = [word for line in f for word in line.split()]
  # this will count all the words in the file (not optimal)
  wordcounts = Counter(words_in_file)
  # interested only in specific words
  counts = list(map(lambda x: wordcounts[x], words))
  # insert first column (filenam)
  counts.insert(0, file)
  # append it to the rest of the rows
  rows.append(counts)

f = open('C:/Users/haris/Downloads/PDF/firstcsv.csv', 'w')
writer = csv.writer(f)
for row in rows:
    writer.writerow(row)
相关问题