始终并行运行恒定数量的子进程

时间:2013-08-08 10:11:38

标签: python python-3.x parallel-processing subprocess multiprocessing

我想使用子进程让20个写入脚本实例并行运行。假设我有一个大的网址列表,其中包含100,000个条目,我的程序应该控制我的脚本的20个实例始终在该列表上工作。我想按如下方式编写代码:

urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
    subprocess.Popen(['python', 'script.py', urllist[i]]
    i = i+1

我的脚本只是将内容写入数据库或文本文件。它没有输出任何东西,也不需要比网址更多的输入。

我的问题是我找不到如何获取活动子进程数的东西。我是一个新手程序员,所以每个提示和建议都是受欢迎的。我还想知道如果加载了20个子进程,while循环再次检查条件,我怎么能管理它?我想过可能会在它上面放一个while循环,比如

while i<100000
   while number_of_subproccesses < 20:
       subprocess.Popen(['python', 'script.py', urllist[i]]
       i = i+1
       if number_of_subprocesses == 20:
           sleep() # wait to some time until check again

或许还有一种可能性,即while循环总是检查子进程的数量?

我还考虑过使用模块多处理,但我发现只使用子处理调用script.py而不是多处理函数非常方便。

也许有人可以帮助我并引导我走向正确的方向。非常感谢!

3 个答案:

答案 0 :(得分:6)

采用与上述不同的方法 - 因为似乎回调不能作为参数发送:

NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000  # Note this would be better to be len(urllist)
Processes = []

def StartNew():
   """ Start a new subprocess if there is work to do """
   global NextURLNo
   global Processes

   if NextURLNo < MaxUrls:
      proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
      print ("Started to Process %s", urllist[NextURLNo])
      NextURLNo += 1
      Processes.append(proc)

def CheckRunning():
   """ Check any running processes and start new ones if there are spare slots."""
   global Processes
   global NextURLNo

   for p in range(len(Processes):0:-1): # Check the processes in reverse order
      if Processes[p].poll() is not None: # If the process hasn't finished will return None
         del Processes[p] # Remove from list - this is why we needed reverse order

   while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
      StartNew()

if __name__ == "__main__":
   CheckRunning() # This will start the max processes running
   while (len(Processes) > 0): # Some thing still going on.
      time.sleep(0.1) # You may wish to change the time for this
      CheckRunning()

   print ("Done!")

答案 1 :(得分:1)

在启动它们时只需保持计数,如果有任何要处理的url列表条目,则使用每个子进程的回调来启动一个回调。

e.g。假设您的子进程在结束时调用传递给它的OnExit方法:

NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000

def StartNew():
   """ Start a new subprocess if there is work to do """
   global NextURLNo
   global NoSubProcess

   if NextURLNo < MaxUrls:
      subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
      print "Started to Process", urllist[NextURLNo]
      NextURLNo += 1
      NoSubProcess += 1

def OnExit():
   NoSubProcess -= 1

if __name__ == "__main__":
   for n in range(MaxProcesses):
      StartNew()
   while (NoSubProcess > 0):
      time.sleep(1)
      if (NextURLNo < MaxUrls):
         for n in range(NoSubProcess,MaxProcesses):
             StartNew()

答案 2 :(得分:1)

要保持常量的并发请求数,可以使用线程池:

#!/usr/bin/env python
from multiprocessing.dummy import Pool

def process_url(url):
    # ... handle a single url

urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
    pass

要运行进程而不是线程,请从导入中删除.dummy