我正在编写一个脚本,该脚本从文件夹“ ExtendedReport”中读取多个文件,并使用DictWriter将每个文件作为字典写入一个csv文件。该程序在多个线程中执行。该文件夹包含sav格式的1100万个文件,文件夹大小为180Gb。
问题是为什么程序会很快开始编写csv,并且随着文件大小的增加它会变慢?而我可以做的就是节省速度。
#%%
os.chdir('ExtendedReport')
fnames = glob.glob('*sav') #list of filenames in a folder
os.chdir('../..')
def worker(q, csv_writer_lock, writer):
while True:
filename = q.get()
if filename is None:
break
with open(filename, 'rb') as f:
json_text = pickle.load(f)
with csv_writer_lock:
if isinstance(json_text['Data']['Report'], dict):
writer.writerow(json_text['Data']['Report'])
elif isinstance(json_text['Data']['Report'], list):
for report in json_text['Data']['Report']:
writer.writerow(report)
#%%
start = time.time()
csv_writer_lock = threading.Lock()
os.chdir('output')
with open('fulldb_ExtendedReport.csv', 'w') as csvfile:
os.chdir('ExtendedReport')
fieldnames = list(added_dict['Report'].keys())
writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
writer.writeheader()
threads = []
queue_size = 6
num_threads = 4
q = Queue(queue_size)
for i in range(num_threads):
th = Thread(target = worker, args = (q, csv_writer_lock, writer))
threads.append(th)
for i in range(num_threads):
threads[i].start()
for i in range(len(fnames)):
q.put(fnames[i])
for i in range(num_threads):
q.put(None)
for i in range(num_threads):
threads[i].join()
os.chdir('..')
print("Writing complete")
os.chdir('..')
end = time.time()
UPD
我对代码进行了更新以衡量执行工作人员的平均时间,并添加了time_list = []
以节省措施:
def worker(q, csv_writer_lock, writer):
while True:
start = time.time()
filename = q.get()
if filename is None:
break
with open(filename, 'rb') as f:
json_text = pickle.load(f)
with csv_writer_lock:
if isinstance(json_text['Data']['Report'], dict):
writer.writerow(json_text['Data']['Report'])
elif isinstance(json_text['Data']['Report'], list):
for report in json_text['Data']['Report']:
writer.writerow(report)
time_list.append(time.time() - start)
我检查了200、2000、20000、50000、100000个文件以进行写入。每次时间约为0.002秒。但是,显然,现在至少有100万个文件后,速度要慢得多。
答案 0 :(得分:0)
编辑后问题已解决
json_text = pickle.load(open(filename, 'rb'))
至
with open(filename, 'rb') as f:
json_text = pickle.load(f)
并重新加载python。 感谢@martineau的想法。
UPD
这不是解决方案。
UPD2
最终,我认为我知道了问题所在。我在装有Fusion Drive的iMac上运行python。这就是为什么写作有时会变慢的原因。
Fusion Drive具有128 Gb SSD和其余2 Tb HDD。这意味着MAC OS在队列的开头将我测试代码(请参阅I checked 200, 2000, 20000, 50000, 100000 files to write.
)处理过的文件移至SSD,因为它们经常被使用。这就是为什么读取开始如此之快的原因,但后来却减慢了速度,因为读取是从HDD开始的。
我检查了filename
中writerow
和worker
的阅读时间,并批准了一段时间后filename
的阅读时间显着增加(至少10-12倍)。同时,等待释放lock
和释放writerow
的过程几乎保持不变。
@martineau,我尝试了两种使用Pool的方法:
def worker_pool(csv_writer_lock, writer, filename):
start = time.time()
with open(filename, 'rb') as f:
json_text = pickle.load(f)
with csv_writer_lock:
if isinstance(json_text['Data']['Report'], dict):
writer.writerow(json_text['Data']['Report'])
elif isinstance(json_text['Data']['Report'], list):
for report in json_text['Data']['Report']:
writer.writerow(report)
time_list.append(time.time() - start)
#%% by Pool
start = time.time()
time_list = []
csv_writer_lock = threading.Lock()
os.chdir('output')
with open('fulldb_ExtendedReport2.csv', 'w') as csvfile:
try:
os.chdir('ExtendedReport')
fieldnames = list(added_dict['Report'].keys())
writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
writer.writeheader()
with ThreadPoolExecutor(max_workers = 6) as pool:
for filename in fnames:
future = pool.submit(worker_pool, csv_writer_lock, writer, filename)
os.chdir('..')
except Exception as e:
print('Error')
os.chdir('..')
print("Writing complete")
os.chdir('..')
end = time.time()
#%% by Pool2
start = time.time()
time_list = []
csv_writer_lock = multiprocessing.dummy.Manager().Lock()
os.chdir('output')
with open('fulldb_ExtendedReport2.csv', 'w') as csvfile:
os.chdir('ExtendedReport')
fieldnames = list(added_dict['Report'].keys())
writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
writer.writeheader()
pool = multiprocessing.dummy.Pool(10)
func = partial(worker_pool, csv_writer_lock, writer)
pool.map(func, fnames)
pool.close()
pool.join()
os.chdir('..')
print("Writing complete")
os.chdir('..')
end = time.time()
但两者都没有提高速度,甚至没有降低速度。也许我做错了...如果您提出意见,不胜感激。