Question

我需要处理两个大文件（> 10亿行），然后根据一个文件中特定行中的信息将每个文件分成小文件。这些文件在blocks中记录了高吞吐量排序数据（我们称排序reads），而每个read包含4行（name，sequence，{{ 1}}，n）。 quality记录在两个文件中的顺序相同。

待办事项

根据read中的file1.fq字段拆分id，

两个文件如下所示：

file2.fq

我编写了以下python函数来完成这项工作：

$ head -n 4 file1.fq
@name1_1
ACTGAAGCGCTACGTCAT
+
A#AAFJJJJJJJJFJFFF

$ head -n 4 file2.fq
@name1_2
TCTCCACCAACAACAGTG
+
FJJFJJJJJJJJJJJAJJ

问题

有什么办法可以加快这个过程？（以上功能太慢了）

如何将大型文件拆分为具有特定def p7_bc_demx_pe(fn1, fn2, id_dict): """Demultiplex PE reads, by p7 index and barcode""" # prepare writers for each small files fn_writer = {} for i in id_dict: fn_writer[i] = [open(id_dict[i] + '.1.fq', 'wt'), open(id_dict[i] + '.2.fq', 'wt')] # go through each record in two files with open(fn1, 'rt') as f1, open(fn2, 'rt') as f2: while True: try: s1 = [next(f1), next(f1), next(f1), next(f1)] s2 = [next(f2), next(f2), next(f2), next(f2)] tag = func(s2) # a function to classify the record fn_writer[tag][0].write(''.join(s1)) fn_writer[tag][1].write(''.join(s2)) except StopIteration: break # close writers for tag in p7_bc_writer: fn_writer[tag][0].close() # close writers fn_writer[tag][1].close() # close writers的块（如f.seek（）），并与多个内核并行运行该过程？

EDIT-1

每个文件中总共有5亿个读取（大小约为180 GB）。瓶颈是lines文件。以下是我当前的解决方案（它有效，但绝对不是最好的方法）

我首先使用shell命令reading and writing（大约需要3个小时）将大文件分成较小的文件。

然后，将功能并行应用于8个小文件（大约需要1个小时）

最后，合并结果（大约需要2个小时）

还没有尝试PySpark，谢谢@John H

Answer 1

查看Spark。您可以将文件分布在整个群集中，以加快处理速度。有一个python API：pyspark。

https://spark.apache.org/docs/0.9.0/python-programming-guide.html

这还为您提供了实际执行Java代码的优点，该代码不受GIL的影响，并允许真正的多线程。

使用Python加速并行读取大文件

1 个答案: