Python多处理比简单的熊猫慢

时间:2019-05-22 15:45:56

标签: python pandas multiprocessing dask

我有一个1.4GB的文件(大约2000万行)可供读取和计算。首先,我只想对其应用过滤器。此外,我想使用多重处理来加速算法。

为此,我在Linux上使用了pandas。我有32GB的内存...

问题是我使用python多重处理的算法比普通算法慢,而且我不明白为什么,我想我错过了一些事情!

  • 迭代算法:
%%time
my_data = pd.read_csv(filenames['my_file'], chunksize=chunk_size)



res = np.array([])
for chunk in my_data : 
    res = np.append(res,chunk[(chunk['field1']=='field1') |
                          (chunk['field2']=='field2')])
CPU times: user 1min 12s, sys: 2.27 s, total: 1min 14s
Wall time: 1min 14s
  • Python多处理算法:
def function(df) :
    return df[(df['field1']=='field1') |
              (df['field2']=='field2')]
%%time
my_data = pd.read_csv(filenames['my_file'], chunksize=chunk_size)

pool = mp.Pool(mp.cpu_count())
funclist = []
res = np.array([])

for chunk in my_data : 
    # process each data frame
    f = pool.apply_async(function,[chunk])
    funclist.append(f)

for f in funclist:
    res = np.append(res, f.get(timeout=10)) # timeout in 10 seconds

pool.close()
pool.join()
CPU times: user 1min 51s, sys: 6.65 s, total: 1min 58s
Wall time: 1min 54s

谢谢!

0 个答案:

没有答案