内存消耗:无法分配内存

时间:2019-05-13 07:27:11

标签: python pandas nlp multiprocessing data-science

无法分配内存,并且所有作业池未追加某些作业失败

我所做的:获得BERT嵌入(3072尺寸),并且现在在多处理池上运行分层集群,但是它占用了大量内存,并且作业失败。服务器分配有48 GB的空间,不再分配更多空间 ?

Data in pickle files which contains list of lists
Ex : [[emb],[tokens],[doc embs]]
emb_0.pkl
function running
62204

emb_1.pkl    运行功能    66505

 slice_emb = []
 slice_tokens = []
 for each in pickled_filepaths:
  print(each)
     with open('folder'+each, 'rb') as f:
     chunk = pickle.load(f)
     emb, token = get_df_emb(chunk)#Get Dataframe function will get the things
     slice_emb.extend(np.array_split(emb, 8)) #slicing into 8
     slice_tokens.extend(np.array_split(token, 8)) #slicing into 8
     print(len(emb))

    import pickle

    from datetime import datetime 
    start=datetime.now()

    pool = multiprocessing.Pool(16) 
    jobs_pool =[]    
    for x,y in zip(slice_emb, slice_tokens):
        print(x.shape)
    print(y.shape)
    pool_chunk = pool.apply_async(cluster_function,[x,y]) #Getting clusters 
    jobs_pool.append(pool_chunk)

    df_list_pool = []
    i =0
    for j in jobs_pool:
       df_list_pool.append(j.get())
       print(df_list_pool[i].shape)
       i +=1

end= datetime.now()
print(end-start)


df_list_pools[] should get all the data in the below format


    a_type      B_type  cluster token   
0   2411    g   1.0 a    
26  9956    g   1.0 b   
27  24323   g   1.0 awq 
28  3460    g   1.0 bw  
226 9732    g   1.0 cp

0 个答案:

没有答案