如何列出目录中的文件并逐一处理? -Python

时间:2018-09-20 07:30:51

标签: python text

我想列出目录中所有文本文件的列表。那么我想在每个文件中创建内容的单独列表。例如document1 = [],然后document2 = [],依此类推。然后通过使用文档1和文档2关键字来计算词频和其他过程。代码正在运行,但无法为列表分配不同的名称,例如document1,依此类推。

import glob
import math
import re

a=0
flist=glob.glob(r'D:/Final Year Project/Development process/Text_data_extraction/MyFolder/*.txt') #get all the files from the d`#open each file >> tokenize the content >> and store it in a set
for fname in flist:         
    tfile=open(fname,"r")
    line=tfile.read()
    a+=1
    line = line.lower() # lowercase
    line = re.sub("</?.*?>"," <> ",line) #remove tags
    line = re.sub("(\\d|\\W)+"," ",line)  # remove special characters and digits
    l_ist = line.split("\n")
    print 'document'
    print(l_ist)
tfile.close() # close the file
print"Number of documents:"
print(a)

2 个答案:

答案 0 :(得分:0)

您可以将在每次迭代中创建的列表分配给以文件名索引的字典:

import glob
import math
import re

flist=glob.glob(r'D:/Final Year Project/Development process/Text_data_extraction/MyFolder/*.txt') #get all the files from the d`#open each file >> tokenize the content >> and store it in a set
content = {}
for fname in flist:         
    tfile=open(fname,"r")
    line=tfile.read()
    line = line.lower() # lowercase
    line = re.sub("</?.*?>"," <> ",line) #remove tags
    line = re.sub("(\\d|\\W)+"," ",line)  # remove special characters and digits
    l_ist = line.split("\n")
    print 'document'
    print(l_ist)
    content[fname] = l_lst
tfile.close() # close the file
print("Number of documents:")
print(len(content))
print(content) # to verify the content of the entire dict

答案 1 :(得分:0)

转到here,我相信与其给出文本文件名称,不如给出目录路径以及名称结构,对于“ document1,document2 ...”,请使用循环或如果文档文件数设置使用它们。