Question

我有一个大数据帧加载到内存中的Pandas（~9GB）。我试图写出一个遵循给定格式（Vowpal Wabbit）的文本文件，并对内存使用和性能感到困惑。虽然文件很大（4800万行），但Pandas的初始负载并不差。写出文件需要至少6个多小时，然后压碎我的笔记本电脑几乎消耗了我的每一块RAM（32GB）。天真地，我假设这个操作一次只能在一行上运行，所以RAM的使用量非常小。有没有更有效的方法来处理这些数据？

with open("C:\\Users\\Desktop\\DATA\\train_mobile2.vw", "wb") as outfile:
    for index, row in train.iterrows():
        if row['click'] ==0:
            vwline=""
            vwline+="-1 "
        else:
            vwline=""
            vwline+="1 "
        vwline+="|a C1_"+ str(row['C1']) +\
        " |b banpos_"+ str(row['banner_pos']) +\
        " |c siteid_"+ str(row['site_id']) +\
        " sitedom_"+ str(row['site_domain']) +\
        " sitecat_"+ str(row['site_category']) +\
        " |d appid_"+ str(row['app_id']) +\
        " app_domain_"+ str(row['app_domain']) +\
        " app_cat_"+ str(row['app_category']) +\
        " |e d_id_"+ str(row['device_id']) +\
        " d_ip_"+ str(row['device_ip']) +\
        " d_os_"+ str(row['device_os']) +\
        " d_make_"+ str(row['device_make']) +\
        " d_mod_"+ str(row['device_model']) +\
        " d_type_"+ str(row['device_type']) +\
        " d_conn_"+ str(row['device_conn_type']) +\
        " d_geo_"+ str(row['device_geo_country']) +\
        " |f num_a:"+ str(row['C17']) +\
        " numb:"+ str(row['C18']) +\
        " numc:"+ str(row['C19']) +\
        " numd:"+ str(row['C20']) +\
        " nume:"+ str(row['C22']) +\
        " numf:"+ str(row['C24']) +\
        " |g c21_"+ str(row['C21']) +\
        " C23_"+ str(row['C23']) +\
        " |h hh_"+ str(row['hh']) +\
        " |i doe_"+ str(row['doe']) 
        outfile.write(vwline + "\n")

回应用户的建议，

我对以下内容进行了编码，但是当它运行的最后一行显示＆＃34;不支持的操作数类型为+：＆＃39; numpy.ndarray＆＃39;时出现错误。和＆＃39; str＆＃39;＆＃34;

lines_T = np.where(train['click'] == 0, "-1 ", "1 ") +\
        "|a C1_" + train['C1'].astype('str') +\
        " |b banpos_"+ train['banner_pos'].astype('str') +\
....

        "|h hh_"+ train['hh'].astype('str')+\
        " |i doe_"+ train['doe'].astype('str')    #ERROR HERE

line_T.to_csv（＆＃34; C：\ Users \ Desktop \ DATA \ KAGGLE \ mobile \ train_mobile.vw＆＃34;，mode =＆＃39; a＆＃39;，header = False，index = False）

Answer 1

不确定内存使用情况，但这肯定会更快：

lines = np.where(train['click'] == 0, "-1 ", "1 ") +
        "|a C1_" + train['C1'].astype('str') +
        " |b banpos_"+ train['banner_pos'].astype('str') +
        ...

然后保存行

lines.to_csv(outfile, index=False)

如果内存成为问题，你也可以分批（比如几百万条记录）进行分类

Python / Pandas逐行写文件::内存使用

1 个答案: