我花了3分钟多的时间遍历了一个4gb的文本文件,计算了行数,每行的单词数和字符数。有更快的方法吗?
这是我的代码:
import time
import csv
import sys
csv.field_size_limit(sys.maxsize)
i=0
countwords={}
countchars={}
start=time.time()
with open("filename.txt", "r", encoding="utf-8") as file:
for line in csv.reader(file, delimiter="\t"):
i+=1
countwords[i]=len(str(line).split())
countchars[i]=len(str(line))
if i%10000==0:
print(i)
end=time.time()
if i>0:
print(i)
print(sum(countwords.values())/i)
print(sum(countchars.values())/i)
print(end-start)
答案 0 :(得分:1)
经过有限测试(在Unix词典上),使用numpy只能获得较小的加速,但是任何胜利都是胜利。我不确定使用csvreader
是解析制表符分隔的文本的好方法,但是我还没有检查这是否可以提供最佳速度。
import time
import numpy
# Holds count of words and letters per line of input
countwords = numpy.array( [] )
countchars = numpy.array( [] )
# Holds total count of words and letters per file
word_sum = 0
char_sum = 0
start = time.time()
file_in = open( "filename.txt", "rt", encoding="utf-8" )
for line in file_in:
# cleanup the line, split it into fields by TAB character
line = line.strip()
fields = line.split( '\t' )
# Count the fields, and the letters of each field's content
field_count = len( fields )
char_count = len( line ) - field_count # don't count the '\t' chars too
# keep a separate count of the fields and letters by line
numpy.append( countwords, field_count )
numpy.append( countchars, char_count )
# Keep a running total to save summation at the end
word_sum += field_count
char_sum += char_count
file_in.close()
end = time.time()
print("Total Words: %3d" % ( word_sum ) )
print("Total Letters: %3d" % ( char_sum ) )
print("Elapsed Time: %.2f" % ( end-start ) )
答案 1 :(得分:0)
您可以避免分配额外的数据,而使用列表代替字典:
import time
import csv
import sys
csv.field_size_limit(sys.maxsize)
countwords=0
countchars=0
start=time.time()
with open("filename.txt", "r", encoding="utf-8") as file:
for i, line in enumerate(csv.reader(file, delimiter="\t")):
words = str(line).split() #we allocate just 1 extra string
wordsLen = len(words)
countwords += wordsLen
# for avoiding posible allocation we iterate throug the chars of the words
# we already have, then we need to add the spaces in between which is
# wordsLen - 1
countchars += len(itertools.chain.from_iterable(words)) + wordsLen - 1)
if i%10000==0:
print(i)
end=time.time()
if i>0:
print(i)
print(countwords/i)
print(countchars/i)
print(end-start)
答案 2 :(得分:0)
我设法编写了另一个版本的快速代码(使用我在不同线程中看到的想法),但是与使用numpy的Kingsley代码相比,它目前有一个缺点,因为它不会每行保存数据,而只会保存数据数据。无论如何,这里是:
import time
start=time.time()
f = open("filename.txt", 'rb')
lines = 0
charcount=0
wordcount=0
#i=10000
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\t')
'''while lines/i>1:
print(i)
i+=10000'''
charcount+=len((buf.strip()))
wordcount+=len((buf.strip()).split())
buf = read_f(buf_size)
end=time.time()
print(end-start)
print(lines)
print(charcount/lines)
print(wordcount/lines)