Question

我有一个类似的python脚本：

def test_run():
     global files_dir
     for f1 in os.listdir(files_dir):
          for f2 os.listdir(files_dir):
               os.system("run program x on f1 and f2")

在不同的处理器上调用每个os.system调用的最佳方法是什么？使用子进程或多处理池？

注意：每次运行程序都会生成一个输出文件。

Answer 1

@ unutbu的答案很好，但是这样做的破坏性较小：使用Pool来传递任务。然后你不必捣乱你自己的队列。例如，

import os
NUM_CPUS = None  # defaults to all available

def worker(f1, f2):
    os.system("run program x on f1 and f2")

def test_run(pool):
     filelist = os.listdir(files_dir)
     for f1 in filelist:
          for f2 in filelist:
               pool.apply_async(worker, args=(f1, f2))

if __name__ == "__main__":
     import multiprocessing as mp
     pool = mp.Pool(NUM_CPUS)
     test_run(pool)
     pool.close()
     pool.join()

这看起来更像是你开始使用的代码。这不一定是件好事; - ）

在最新版本的Python 3中，Pool个对象也可以用作上下文管理器，因此尾端可以简化为：

if __name__ == "__main__":
     import multiprocessing as mp
     with mp.Pool(NUM_CPUS) as pool:
         test_run(pool)

编辑：改为使用concurrent.futures

对于像这样非常简单的任务，Python 3的concurrent.futures可以更容易使用。替换上面的代码，从test_run()开始，如下所示：

def test_run():
     import concurrent.futures as cf
     filelist = os.listdir(files_dir)
     with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
         for f1 in filelist:
             for f2 in filelist:
                 pp.submit(worker, f1, f2)

if __name__ == "__main__":
     test_run()

如果您不希望工作进程中的异常无声地消失，那么它需要更高级。这是所有并行噱头的潜在问题。问题是在主程序中通常没有好的方法来引发异常，因为它们发生在上下文（工作进程）中，这可能与主程序当时正在做的事情无关。在主程序中引发异常（re）的一种方法是明确要求结果;例如，将上面的内容更改为：

def test_run():
     import concurrent.futures as cf
     filelist = os.listdir(files_dir)
     futures = []
     with cf.ProcessPoolExecutor(NUM_CPUS) as pp:
         for f1 in filelist:
             for f2 in filelist:
                 futures.append(pp.submit(worker, f1, f2))
     for future in cf.as_completed(futures):
         future.result()

然后，如果在工作进程中发生异常，则当future.result()应用于表示失败的进程间调用的Future对象时，{{1}}将在主程序中重新引发该异常。 / p>

可能比你想知道的更多; - ）

Answer 2

您可以使用subprocess和multiprocessing的混合物。两个为什么？如果你只是天真地使用子进程，你会产生与任务一样多的子进程。您可以轻松地完成数千个任务，并且同时产生许多子过程可能会使您的机器陷入困境。

因此，您可以使用multiprocessing仅生成与您的计算机具有CPU（mp.cpu_count()）一样多的工作进程。然后，每个工作进程可以从队列中读取任务（文件名对），并生成子进程。然后，工作人员应该等到子进程完成，然后再从队列中处理另一个任务。

import multiprocessing as mp
import itertools as IT
import subprocess

SENTINEL = None
def worker(queue):
    # read items from the queue and spawn subproceses
    # The for-loop ends when queue.get() returns SENTINEL
    for f1, f2 in iter(queue.get, SENTINEL):
        proc = subprocess.Popen(['prog', f1, f2])
        proc.communicate()

def test_run(files_dir):
    # avoid globals when possible. Pass files_dir as an argument to the function
    # global files_dir  
    queue = mp.Queue()

    # Setup worker processes. The workers will all read from the same queue.
    procs = [mp.Process(target=worker, args=[queue]) for i in mp.cpu_count()]
    for p in procs:
        p.start()

    # put items (tasks) in the queue
    files = os.listdir(files_dir)
    for f1, f2 in IT.product(files, repeat=2):
        queue.put((f1, f2))
    # Put sentinels in the queue to signal the worker processes to end    
    for p in procs:    
        queue.put(SENTINEL)

    for p in procs:
        p.join()

python中的多线程系统调用

2 个答案: