Python / Pandas逐行写文件::内存使用

时间:2014-11-11 14:33:45

标签: python pandas

我有一个大数据帧加载到内存中的Pandas(~9GB)。我试图写出一个遵循给定格式(Vowpal Wabbit)的文本文件,并对内存使用和性能感到困惑。虽然文件很大(4800万行),但Pandas的初始负载并不差。写出文件需要至少6个多小时,然后压碎我的笔记本电脑几乎消耗了我的每一块RAM(32GB)。天真地,我假设这个操作一次只能在一行上运行,所以RAM的使用量非常小。有没有更有效的方法来处理这些数据?

with open("C:\\Users\\Desktop\\DATA\\train_mobile2.vw", "wb") as outfile:
    for index, row in train.iterrows():
        if row['click'] ==0:
            vwline=""
            vwline+="-1 "
        else:
            vwline=""
            vwline+="1 "
        vwline+="|a C1_"+ str(row['C1']) +\
        " |b banpos_"+ str(row['banner_pos']) +\
        " |c siteid_"+ str(row['site_id']) +\
        " sitedom_"+ str(row['site_domain']) +\
        " sitecat_"+ str(row['site_category']) +\
        " |d appid_"+ str(row['app_id']) +\
        " app_domain_"+ str(row['app_domain']) +\
        " app_cat_"+ str(row['app_category']) +\
        " |e d_id_"+ str(row['device_id']) +\
        " d_ip_"+ str(row['device_ip']) +\
        " d_os_"+ str(row['device_os']) +\
        " d_make_"+ str(row['device_make']) +\
        " d_mod_"+ str(row['device_model']) +\
        " d_type_"+ str(row['device_type']) +\
        " d_conn_"+ str(row['device_conn_type']) +\
        " d_geo_"+ str(row['device_geo_country']) +\
        " |f num_a:"+ str(row['C17']) +\
        " numb:"+ str(row['C18']) +\
        " numc:"+ str(row['C19']) +\
        " numd:"+ str(row['C20']) +\
        " nume:"+ str(row['C22']) +\
        " numf:"+ str(row['C24']) +\
        " |g c21_"+ str(row['C21']) +\
        " C23_"+ str(row['C23']) +\
        " |h hh_"+ str(row['hh']) +\
        " |i doe_"+ str(row['doe']) 
        outfile.write(vwline + "\n")

回应用户的建议,

我对以下内容进行了编码,但是当它运行的最后一行显示"不支持的操作数类型为+:' numpy.ndarray'时出现错误。和' str'"

lines_T = np.where(train['click'] == 0, "-1 ", "1 ") +\
        "|a C1_" + train['C1'].astype('str') +\
        " |b banpos_"+ train['banner_pos'].astype('str') +\
....

        "|h hh_"+ train['hh'].astype('str')+\
        " |i doe_"+ train['doe'].astype('str')    #ERROR HERE

line_T.to_csv(" C:\ Users \ Desktop \ DATA \ KAGGLE \ mobile \ train_mobile.vw",mode =' a',header = False,index = False)

1 个答案:

答案 0 :(得分:1)

不确定内存使用情况,但这肯定会更快:

lines = np.where(train['click'] == 0, "-1 ", "1 ") +
        "|a C1_" + train['C1'].astype('str') +
        " |b banpos_"+ train['banner_pos'].astype('str') +
        ...

然后保存行

lines.to_csv(outfile, index=False)

如果内存成为问题,你也可以分批(比如几百万条记录)进行分类

相关问题