Question

我的代码流是这样的：

import pandas as pd
import threading
import helpers

for file in files:
    df_full = pd.read_csv(file, chunksize=500000)
    for df in df_full:
        df_ready = prepare_df(df)
        # testing if the previous instance is running
        if isinstance(upload_thread, threading.Thread):
            if upload_thread.isAlive():
                print('waiting for the last upload op to finish')
                upload_thread.join()

        # starts the upload in another thread, so the loop can continue on the next chunk
        upload_thread = threading.Thread(target=helpers.uploading, kwargs=kwargs)
        upload_thread.start()

它起作用了，问题在于：使用线程运行它会使它变慢！

我对代码流的想法是：

处理大量数据
完成后，将其上传到后台
上传时，将循环前进到下一步，即处理下一个数据块

从理论上讲，听起来不错，但是经过大量的试验和计时，我相信线程正在减慢代码流。

我确定我搞砸了，请帮助我找出问题所在。

此外，此功能'helpers.uploading'向我返回重要结果。如何获得这些结果？理想情况下，我需要将每次迭代的结果附加到结果列表中。如果没有线程，则类似于：

import pandas as pd
import helpers

results = []

for file in files:
    df_full = pd.read_csv(file, chunksize=500000)
    for df in df_full:
        df_ready = prepare_df(df)
        result = helpers.uploading(**kwargs)
        results.append(result)

谢谢！

python-并行写入数据的单独线程使我的代码变慢-但是为什么呢？

0 个答案: