Question

我花了3分钟多的时间遍历了一个4gb的文本文件，计算了行数，每行的单词数和字符数。有更快的方法吗？

这是我的代码：

import time
import csv
import sys
csv.field_size_limit(sys.maxsize)
i=0
countwords={}
countchars={}
start=time.time()
with open("filename.txt", "r", encoding="utf-8") as file:
    for line in csv.reader(file, delimiter="\t"):
        i+=1
        countwords[i]=len(str(line).split())
        countchars[i]=len(str(line))
        if i%10000==0:
            print(i)
end=time.time()
if i>0:
    print(i)
    print(sum(countwords.values())/i)
    print(sum(countchars.values())/i)
    print(end-start)

Answer 1

经过有限测试（在Unix词典上），使用numpy只能获得较小的加速，但是任何胜利都是胜利。我不确定使用csvreader是解析制表符分隔的文本的好方法，但是我还没有检查这是否可以提供最佳速度。

import time
import numpy

# Holds count of words and letters per line of input
countwords = numpy.array( [] )
countchars = numpy.array( [] )

# Holds total count of words and letters per file
word_sum = 0
char_sum = 0

start = time.time()

file_in = open( "filename.txt", "rt", encoding="utf-8" )
for line in file_in:
    # cleanup the line, split it into fields by TAB character
    line   = line.strip()
    fields = line.split( '\t' )

    # Count the fields, and the letters of each field's content
    field_count = len( fields )
    char_count  = len( line ) - field_count   # don't count the '\t' chars too

    # keep a separate count of the fields and letters by line
    numpy.append( countwords, field_count )
    numpy.append( countchars, char_count )

    # Keep a running total to save summation at the end
    word_sum += field_count
    char_sum += char_count

file_in.close()

end = time.time()

print("Total Words:   %3d"  % ( word_sum ) )
print("Total Letters: %3d"  % ( char_sum ) )
print("Elapsed Time:  %.2f" % ( end-start ) )

Answer 2

您可以避免分配额外的数据，而使用列表代替字典：

import time
import csv
import sys
csv.field_size_limit(sys.maxsize)
countwords=0
countchars=0
start=time.time()
with open("filename.txt", "r", encoding="utf-8") as file:
    for i, line in enumerate(csv.reader(file, delimiter="\t")):
        words = str(line).split() #we allocate just 1 extra string
        wordsLen = len(words)
        countwords += wordsLen
        # for avoiding posible allocation we iterate throug the chars of the words
        # we already have, then we need to add the spaces in between which is 
        # wordsLen - 1
        countchars += len(itertools.chain.from_iterable(words)) + wordsLen - 1)
        if i%10000==0:
            print(i)
end=time.time()
if i>0:
    print(i)
    print(countwords/i)
    print(countchars/i)
    print(end-start)

Answer 3

我设法编写了另一个版本的快速代码（使用我在不同线程中看到的想法），但是与使用numpy的Kingsley代码相比，它目前有一个缺点，因为它不会每行保存数据，而只会保存数据数据。无论如何，这里是：

import time

start=time.time()
f = open("filename.txt", 'rb')
lines = 0
charcount=0
wordcount=0
#i=10000
buf_size = 1024 * 1024
read_f = f.raw.read

buf = read_f(buf_size)
while buf:
    lines += buf.count(b'\t')
    '''while lines/i>1:
        print(i)
        i+=10000'''
    charcount+=len((buf.strip()))
    wordcount+=len((buf.strip()).split())
    buf = read_f(buf_size)

end=time.time()

print(end-start)
print(lines)
print(charcount/lines)
print(wordcount/lines)

如何加快循环遍历4GB制表符分隔的文本文件

3 个答案: