多处理批处理在python中突然停止

时间:2018-03-22 12:13:42

标签: python multiprocessing word2vec

我正在使用gensim word2vec从与查询文本匹配的语料库中返回最相似的文本。例如,以下是一些相关的代码行:

model = gensim.models.KeyedVectors.load_word2vec_format('/users/myuser/method_approaches/google_news_requirements/GoogleNews-vectors-negative300.bin.gz', binary=True)
instance = WmdSimilarity(processed_set, model, num_best=10)

然后我有这个非常简单的函数,它在传递给多处理器时运行实例:

def get_most_similar_for_a_given_text(instance,text,output):
    i=instance[text]
    output.put(i)

然后我有一个批量多处理脚本

def get_most_similar_for_all_texts_in_set(processed_set, instance):
    output = mp.Queue()
    # Setup a list of processes that we want to run
    processes = [mp.Process(target=get_most_similar_for_a_given_text, args=(instance, text, output)) for text in processed_set]
    num_cores = mp.cpu_count()
    Scaling_factor_batch_jobs = 3
    number_of_jobs = len(processes)
    num_jobs_per_batch = num_cores * Scaling_factor_batch_jobs
    num_of_batches = int(number_of_jobs // num_jobs_per_batch)+1 
    print('\n'+'Running batches now...')
    for i in tqdm.tqdm(range(num_of_batches)):
        # although the floor/ceilings look like things are getting double counted, for instance with ranges being 0:24,24:48,48.. etc.. this is not the case, for whatever reason it doesn't work like that
        if i<num_of_batches-1: # true for all but last one
            floor_job = int(i * num_jobs_per_batch) # int because otherwise it's a float and mp doesn't like that
            ceil_job  = int(floor_job + num_jobs_per_batch)
            # Run processes
            for p in processes[floor_job : ceil_job]:
                p.start()
            for p in processes[floor_job : ceil_job]:
                p.join()
            for p in mp.active_children():
                p.terminate()
            print(floor_job,ceil_job)
        else: # true on last job, which picks up the missing batches that were lost due to rounding in the num_of_batches/num_jobs_per_batch formulas
            floor_job = int(i * num_jobs_per_batch)
            # Run processes
            for p in processes[floor_job:]:
                p.start()
            for p in processes[floor_job:]:
                p.join()
            for p in mp.active_children():
                p.terminate()
            print(floor_job,len(processes))
    # Get process results from the output queue
    results = [output.get() for p in tqdm.tqdm(processes)]
    np.save('/users/josh.flori/method_approaches/numpy_files/wmd_results_list.npy', results)
    return results

当我运行它时实际发生的是它以1:4的比例运行批次。这些批次占了processed_set中的文本0:96,这是我循环的文本。但随后它进入第5批,文本96:120,它似乎只是停止处理但不会失败或退出或崩溃或做任何事情。从视觉上看,它看起来仍然在运行,但不是因为我的cpu活动回落而进度条停止移动。

我从processed_set视觉检查了文本96:120,没有什么看起来很奇怪。然后我在多处理函数之外的那些文本中单独运行了get_most_similar_for_a_given_text函数,它们就完成了。

无论如何,重申一下,它总是发生在第5批。有人在这里有任何见解吗?我不太熟悉多处理的工作原理。

再次感谢

1 个答案:

答案 0 :(得分:0)

这可能是因为您正在使用队列。如果队列已满,则该进程将被阻塞尝试放入队列。尝试使用非常小的processed_set进行测试,看看是否所有作业都完成了。如果是这样,您可能希望使用管道来获得大量结果。

相关问题