python并行进程列表与数据框架

时间:2017-01-17 03:42:13

标签: python multithreading pandas

我有一个包含数据框的列表。在循环内部,我遍历此列表清理列表中的每个数据框并转储到另一个列表并返回该列表:

allDfs = []
def processDfs(self):

    for df in listOfDfs():
        for column_name in need_to_change_column_name:
            ...# some column name changes
        df.set_index('id', inplace=True)

        ## dropping any na
        df = df.dropna()
        ...

        df['cost'] = df['cost'].astype('float64')

        allDfs.append(df)

    return allDfs

如何在多个线程中的listOfDfs中分发每个数据帧的处理?并收集它并返回进程dfs的列表。

1 个答案:

答案 0 :(得分:1)

使用多处理模块:

from multiprocessing import Pool

# enter the desired number of processes here
NUM_PROCS = 8    

def process_single_df(df):
    """
    Function that processes a single df.
    """
    for column_name in need_to_change_column_name:
        # some column name changes
        ...

    df.set_index('id', inplace=True)

    ## dropping any na
    df = df.dropna()
    ...

    df['cost'] = df['cost'].astype('float64')

    return df

pool = Pool(processes=NUM_PROCS)

allDfs = pool.map(process_single_df, listOfDfs)

对pool.map的调用是阻塞的,这意味着它将等待所有进程完成,然后程序才能继续。

如果你不需要立即使用allDfs(你很乐意在并行处理工作时继续计算其他东西),你可以在最后一行使用pool.map_async

# get async result instead (non-blocking)
async_result = pool.map_async(process_single_df, listOfDfs)
# do other stuff
...
# ok, now I need allDfs so will call async_result.get
allDfs = async_result.get()