Question

我正在尝试使用多处理作为加快数据处理速度的一种方法。我的数据包含3000json文件，我的代码如下：

def analyse_page(file, arg1, arg2, arg3, arg4):
    with open(file) as f:
        data = json.load(f)
    for i in range(data):
        data[i] = treat_item(data[i], arg1, arg2, arg3, arg4)
    with open(output_json, 'w') as f:
        json.dump(f,data)

for file in files:
    analyse_page(file, arg1, arg2, arg3, arg4)

print('done!')

因此，想法是处理json的项目，然后输出修改后的json。我看到我的计算机为一个简单的For循环使用15％Cpu的功率，因此我决定使用Multiprocessing，但是遇到了一个我无法理解的问题。我已经尝试过Process和Pool，无论是成块的还是全部的，但是，每次它总是可以处理三分之一的文件，然后脚本停止运行而不会出错！

因此，我再次使用if os.path.exists(): continue启动代码，以便忽略处理过的文件。即使这样，它也会处理另外三分之一的文件并停止。因此，当我再次启动它时，它又执行了另一次，然后打印done!

analyse_page函数每页大约需要3s，因此长时间内在多处理中启动同一函数的正确方法是什么？

更新，我已经完成的事情：

处理

processes = []
for file in files:
    p = multiprocessing.Process(target=analyse_page, args=(file, arg1, arg2, arg3, arg4,))
    processes.append(p)
    p.start()
for process in processes:
    process.join()

批处理

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i + n]

processes = []
numberOfThreads = 6 #Max is 8
For file in files:
    p = multiprocessing.Process(target=analyse_page, args=(file, arg1, arg2, arg3, arg4,))
    processes.append(p)

for i in chunks(processes,numberOfThreads):
    for j in i:
        j.start()
    for j in i:
        j.join()

游泳池

pool = multiprocessing.Pool(6)
For file in files:
    pool.map(analyse_page, (file, arg1, arg2, arg3, arg4,))
pool.close()

Answer 1

要轻松处理多进程，可以使用 concurrent.futures 模块。

Python Documentation: Concurrent Futures

在我解释各个方面之前，有一个很棒且简单的视频教程，其中带有示例代码（易于适应）：

YouTube: Tutorial Multiprocessing

对于处理具有多个处理器或线程的许多任务，我建议使用队列模块。

Python Documentation: Queue

from queue import Queue

#Create Queue object
q = Queue()

#Put item to queue
q.put("my value")

#Get and process each item in queue and remove it
while not q.empty():
    myValue = q.get()

在python中多处理大量数据

1 个答案: