为什么csv writer随着csv大小的增加而变慢

时间:2019-12-16 15:25:59

标签: python multithreading csv

我正在编写一个脚本,该脚本从文件夹“ ExtendedReport”中读取多个文件,并使用DictWriter将每个文件作为字典写入一个csv文件。该程序在多个线程中执行。该文件夹包含sav格式的1100万个文件,文件夹大小为180Gb。

问题是为什么程序会很快开始编写csv,并且随着文件大小的增加它会变慢?而我可以做的就是节省速度。

#%%
os.chdir('ExtendedReport')
fnames = glob.glob('*sav')   #list of filenames in a folder
os.chdir('../..')

def worker(q, csv_writer_lock, writer):
    while True:
        filename = q.get()
        if filename is None:
            break

        with open(filename, 'rb') as f:
            json_text = pickle.load(f)

        with csv_writer_lock:
            if isinstance(json_text['Data']['Report'], dict):
                writer.writerow(json_text['Data']['Report'])
            elif isinstance(json_text['Data']['Report'], list):
                for report in json_text['Data']['Report']:
                    writer.writerow(report)



#%%
start = time.time()
csv_writer_lock = threading.Lock()
os.chdir('output')
with open('fulldb_ExtendedReport.csv', 'w') as csvfile:
    os.chdir('ExtendedReport')
    fieldnames = list(added_dict['Report'].keys())
    writer = csv.DictWriter(csvfile, fieldnames = fieldnames)

    writer.writeheader()

    threads = []
    queue_size = 6
    num_threads = 4

    q = Queue(queue_size)
    for i in range(num_threads):
        th = Thread(target = worker, args = (q, csv_writer_lock, writer))
        threads.append(th)

    for i in range(num_threads):
        threads[i].start()

    for i in range(len(fnames)):
        q.put(fnames[i])

    for i in range(num_threads):
        q.put(None)

    for i in range(num_threads):
        threads[i].join()

    os.chdir('..')
print("Writing complete")
os.chdir('..')
end = time.time() 

UPD
我对代码进行了更新以衡量执行工作人员的平均时间,并添加了time_list = []以节省措施:

def worker(q, csv_writer_lock, writer):
    while True:
        start = time.time()
        filename = q.get()
        if filename is None:
            break

        with open(filename, 'rb') as f:
            json_text = pickle.load(f)

        with csv_writer_lock:
            if isinstance(json_text['Data']['Report'], dict):
                writer.writerow(json_text['Data']['Report'])
            elif isinstance(json_text['Data']['Report'], list):
                for report in json_text['Data']['Report']:
                    writer.writerow(report)

        time_list.append(time.time() - start)

我检查了200、2000、20000、50000、100000个文件以进行写入。每次时间约为0.002秒。但是,显然,现在至少有100万个文件后,速度要慢得多。

1 个答案:

答案 0 :(得分:0)

编辑后问题已解决 json_text = pickle.load(open(filename, 'rb'))

with open(filename, 'rb') as f:
    json_text = pickle.load(f)

并重新加载python。 感谢@martineau的想法。

UPD
这不是解决方案。

UPD2
最终,我认为我知道了问题所在。我在装有Fusion Drive的iMac上运行python。这就是为什么写作有时会变慢的原因。

Fusion Drive具有128 Gb SSD和其余2 Tb HDD。这意味着MAC OS在队列的开头将我测试代码(请参阅I checked 200, 2000, 20000, 50000, 100000 files to write.)处理过的文件移至SSD,因为它们经常被使用。这就是为什么读取开始如此之快的原因,但后来却减慢了速度,因为读取是从HDD开始的。

我检查了filenamewriterowworker的阅读时间,并批准了一段时间后filename的阅读时间显着增加(至少10-12倍)。同时,等待释放lock和释放writerow的过程几乎保持不变。

@martineau,我尝试了两种使用Pool的方法:

def worker_pool(csv_writer_lock, writer, filename):
    start = time.time()

    with open(filename, 'rb') as f:
        json_text = pickle.load(f)

    with csv_writer_lock:
        if isinstance(json_text['Data']['Report'], dict):
            writer.writerow(json_text['Data']['Report'])
        elif isinstance(json_text['Data']['Report'], list):
            for report in json_text['Data']['Report']:
                writer.writerow(report)

    time_list.append(time.time() - start)

#%% by Pool
start = time.time()
time_list = []
csv_writer_lock = threading.Lock()
os.chdir('output')
with open('fulldb_ExtendedReport2.csv', 'w') as csvfile:
    try:
        os.chdir('ExtendedReport')
        fieldnames = list(added_dict['Report'].keys())
        writer = csv.DictWriter(csvfile, fieldnames = fieldnames)

        writer.writeheader()

        with ThreadPoolExecutor(max_workers = 6) as pool:
            for filename in fnames:
                future = pool.submit(worker_pool, csv_writer_lock, writer, filename)

        os.chdir('..')
    except Exception as e:
        print('Error')
        os.chdir('..')
print("Writing complete")
os.chdir('..')
end = time.time()

#%% by Pool2
start = time.time()
time_list = []
csv_writer_lock =  multiprocessing.dummy.Manager().Lock()
os.chdir('output')
with open('fulldb_ExtendedReport2.csv', 'w') as csvfile:
    os.chdir('ExtendedReport')
    fieldnames = list(added_dict['Report'].keys())
    writer = csv.DictWriter(csvfile, fieldnames = fieldnames)

    writer.writeheader()

    pool = multiprocessing.dummy.Pool(10)
    func = partial(worker_pool, csv_writer_lock, writer)
    pool.map(func, fnames)
    pool.close()
    pool.join()

    os.chdir('..')
print("Writing complete")
os.chdir('..')
end = time.time()

但两者都没有提高速度,甚至没有降低速度。也许我做错了...如果您提出意见,不胜感激。