Question

串行连接多个文件的最快方法是什么？（在Python中）？

假设我有两个文件，每行有1,000,000,000行和〜200个UTF8字符。

方法1：与paste

作弊

我可以在shell中使用paste连接linux系统下的两个文件，我可以使用os.system作弊，即：

def concat_files_cheat(file_path, file1, file2, output_path, output):
    file1 = os.path.join(file_path, file1)
    file2 = os.path.join(file_path, file2)
    output = os.path.join(output_path, output)
    if not os.path.exists(output):
        os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

方法2：使用嵌套的上下文管理器zip：

def concat_files_zip(file_path, file1, file2, output_path, output):
    with open(output, 'wb') as fout:
        with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
            for line1, line2 in zip(fin1, fin2):
                fout.write(line1 + '\t' + line2)

方法3：使用fileinput

fileinput是否并行遍历文件？或者他们会在另一个文件之后依次遍历每个文件吗？

如果是前者，我会认为它看起来像这样：

def concat_files_fileinput(file_path, file1, file2, output_path, output):
    with fileinput.input(files=(file1, file2)) as f:
        for line in f:
            line1, line2 = process(line)
            fout.write(line1 + '\t' + line2)

方法4 ：将其视为csv

with open(output, 'wb') as fout:
    with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
        writer = csv.writer(w)
        reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
        for line1, line2 in zip(reader1, reader2):
          writer.writerow(line1 + '\t' + line2)

鉴于数据大小，这将是最快的？

为什么选择一个而不是另一个？我会丢失或添加信息吗？

对于每种方法，我如何选择,或\t以外的其他分隔符？

是否有其他方法可以明智地实现相同的连接列？它们一样快吗？

Answer 1

从所有四种方法中我都会选择第二种方法。但你必须要照顾实施中的小细节。（有一些改进需要 0.002秒同时原始实现大约需要 6秒;我正在工作的文件是1M行;但是如果不是太大的差异，那么因为我们没有使用差不多的内存，所以文件大1K倍。

原始实施的变化：

如果可能，请使用迭代器，否则内存消耗将受到惩罚，您必须立即处理整个文件。（主要是如果你使用python 2，而不是使用zip使用itertools.izip）
当您连接字符串时，请使用“％s％s”.format（）或类似字符;否则每次都会生成一个新的字符串实例。
不需要在for内部逐行编写。您可以在write中使用迭代器。
小缓冲区非常有趣但是如果我们使用迭代器，差异非常小，但是如果我们尝试一次获取所有数据（例如，我们放f1.readlines（1024 * 1000），它会慢得多））。

示例：

def concat_iter(file1, file2, output):
    with open(output, 'w', 1024) as fo, \
        open(file1, 'r') as f1, \
        open(file2, 'r') as f2:
        fo.write("".join("{}\t{}".format(l1, l2) 
           for l1, l2 in izip(f1.readlines(1024), 
                              f2.readlines(1024))))

Profiler原始解决方案。

我们发现最大的问题是write和zip（主要是因为没有使用迭代器，必须处理/处理内存中的所有文件）。

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)
    1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    **9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}
    1    1.072    1.072    1.072    1.072 {zip}

探查：

~/personal/python-algorithms/files$ python -m cProfile sol1.py 
     3731 function calls in 0.002 seconds

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)
    1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)
 1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}
    1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}
    **1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}

在python 3中甚至更快，因为迭代器是内置的，我们不需要导入任何库。

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

同样非常高兴看到内存消耗和文件系统访问确认了我们之前所说的内容：

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0


$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696

Answer 2

您可以将for循环替换为writelines，将genexp传递给它，然后在方法2中将zip替换为izip itertools。接近paste或超越它。

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
    fout.writelines(b"{}\t{}".format(*line) for line in izip(fin1, fin2))

如果您不想在格式字符串中嵌入\t，可以使用repeat中的itertools;

    fout.writelines(b"{}{}{}".format(*line) for line in izip(fin1, repeat(b'\t'), fin2))

如果文件长度相同，则可以取消izip。

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
    fout.writelines(b"{}\t{}".format(line, next(fin2)) for line in fin1)

Answer 3

您可以尝试使用timeit测试您的功能。这doc 可能会有所帮助。

或者在Jupyter笔记本中使用相同的魔术函数%%timeit。您只需要写%%timeit func(data)，您就会得到一个对您的功能进行评估的回复。这个paper可以帮助您。

Answer 4

方法＃1是最快的，因为它使用本机（而不是Python）代码来连接文件。然而，这绝对是作弊。

如果你想作弊，你也可以考虑为Python写自己的C扩展 - 它可能更快，取决于你的编码技能。

我担心方法＃4不起作用，因为你用字符串连接列表。我会选择writer.writerow(line1 + line2)。您可以使用delimiter和csv.reader的{{1}}参数来自定义分隔符（请参阅https://docs.python.org/2/library/csv.html）。

逐列连接多个文件的最快方法 - Python

4 个答案: