我正在使用gensim word2vec从与查询文本匹配的语料库中返回最相似的文本。例如,以下是一些相关的代码行:
model = gensim.models.KeyedVectors.load_word2vec_format('/users/myuser/method_approaches/google_news_requirements/GoogleNews-vectors-negative300.bin.gz', binary=True)
instance = WmdSimilarity(processed_set, model, num_best=10)
然后我有这个非常简单的函数,它在传递给多处理器时运行实例:
def get_most_similar_for_a_given_text(instance,text,output):
i=instance[text]
output.put(i)
然后我有一个批量多处理脚本
def get_most_similar_for_all_texts_in_set(processed_set, instance):
output = mp.Queue()
# Setup a list of processes that we want to run
processes = [mp.Process(target=get_most_similar_for_a_given_text, args=(instance, text, output)) for text in processed_set]
num_cores = mp.cpu_count()
Scaling_factor_batch_jobs = 3
number_of_jobs = len(processes)
num_jobs_per_batch = num_cores * Scaling_factor_batch_jobs
num_of_batches = int(number_of_jobs // num_jobs_per_batch)+1
print('\n'+'Running batches now...')
for i in tqdm.tqdm(range(num_of_batches)):
# although the floor/ceilings look like things are getting double counted, for instance with ranges being 0:24,24:48,48.. etc.. this is not the case, for whatever reason it doesn't work like that
if i<num_of_batches-1: # true for all but last one
floor_job = int(i * num_jobs_per_batch) # int because otherwise it's a float and mp doesn't like that
ceil_job = int(floor_job + num_jobs_per_batch)
# Run processes
for p in processes[floor_job : ceil_job]:
p.start()
for p in processes[floor_job : ceil_job]:
p.join()
for p in mp.active_children():
p.terminate()
print(floor_job,ceil_job)
else: # true on last job, which picks up the missing batches that were lost due to rounding in the num_of_batches/num_jobs_per_batch formulas
floor_job = int(i * num_jobs_per_batch)
# Run processes
for p in processes[floor_job:]:
p.start()
for p in processes[floor_job:]:
p.join()
for p in mp.active_children():
p.terminate()
print(floor_job,len(processes))
# Get process results from the output queue
results = [output.get() for p in tqdm.tqdm(processes)]
np.save('/users/josh.flori/method_approaches/numpy_files/wmd_results_list.npy', results)
return results
当我运行它时实际发生的是它以1:4的比例运行批次。这些批次占了processed_set中的文本0:96,这是我循环的文本。但随后它进入第5批,文本96:120,它似乎只是停止处理但不会失败或退出或崩溃或做任何事情。从视觉上看,它看起来仍然在运行,但不是因为我的cpu活动回落而进度条停止移动。
我从processed_set视觉检查了文本96:120,没有什么看起来很奇怪。然后我在多处理函数之外的那些文本中单独运行了get_most_similar_for_a_given_text函数,它们就完成了。
无论如何,重申一下,它总是发生在第5批。有人在这里有任何见解吗?我不太熟悉多处理的工作原理。
再次感谢
答案 0 :(得分:0)
这可能是因为您正在使用队列。如果队列已满,则该进程将被阻塞尝试放入队列。尝试使用非常小的processed_set进行测试,看看是否所有作业都完成了。如果是这样,您可能希望使用管道来获得大量结果。