Question

我正在使用DASK读取大小约为2GB的CSV文件。我想根据以下某种哈希函数将其每一行写成255个CSV文件。

我的幼稚解决方案：

from dask import dataframe as dd

if __name__ == '__main__':
    df = dd.read_csv('train.csv', header=None, dtype='str')
    df = df.fillna()
    for _, line in df.iterrows():
        number = hash(line[2]) % 256
        with open("{}.csv".format(number), 'a+') as f:
            f.write(', '.join(line))

这种方式大约需要15分钟。有什么方法可以更快地做到这一点。

Answer 1

Since your procedure is dominated by IO, it is very unlikely that Dask would do anything but add overhead in this case, unless your hash function is really really slow. I assume that is not the case.

@zwer 's solution would look something like

files = [open("{}.csv".format(number), 'a+') for number in range(255)]
for _, line in df.iterrows():
    number = hash(line[2]) % 256
    files[number].write(', '.join(line))
[f.close() for f in files]

However, your data appears to fit in memory, so you may find much better performance

for (number, group) in df.groupby(df.iloc[:, 2].map(hash)):
    group.to_csv("{}.csv".format(number))

because you write to each file continuously rather than jumping between them. Depending on your IO device and buffering, the difference can be none or huge.

DASK-读取巨大的CSV并写入255个不同的CSV文件

我的幼稚解决方案：

1 个答案: