Question

我正在使用file.dat打开一个名为pandas.read_csv的文件。 file.dat包含几亿行，因此其大小超出了我的可用内存。该文件如下所示：

2.069921794968841368e+03 4.998600000000000000e+04
2.069943528235504346e+03 4.998600000000000000e+04
2.070004614137329099e+03 4.998300000000000000e+04
2.070022949424665057e+03 4.998100000000000000e+04
2.070029861936420730e+03 4.998000000000000000e+04
....
.... 
....

打开文件的代码段为：

file = pd.read_csv("file.dat", 
                     delim_whitespace = True, index_col = None,
                     iterator = True, chunksize = 1000)

我有一个函数process，该函数遍历file并执行分析：

def process(file, arg):
    output = []
    for chunk in file: # iterate through each chunk of the file 
        val = evaluate(chunk, arg) # do something involving chunk and arg
        output.append(val) # and incorporate this into output
    return output # then return the result

一切正常。但是，要多次运行process(file, arg)，我必须重新运行file = pd.read_csv代码段。例如，这不起作用：

outputs = []
for arg in [arg1, arg2, arg3]:
    outputs.append(process(file, arg))

但是这样做：

outputs = []
for arg in [arg1, arg2, arg3]:
    `file = pd.read_csv("file.dat", 
                         delim_whitespace = True, index_col = None,
                         iterator = True, chunksize = 1000)
    outputs.append(process(file, arg))

本质问题是pd.read_csv产生的可迭代项只能使用一次。为什么会这样呢？这是预期的行为吗？

Answer 1

这是预期的行为，因为带有指定TextFileReader参数的pd.read_csv函数返回的chunksize对象是 Iterator，而不是Iterable

我承认，关于您将返回的对象有一些令人困惑的措辞。 Here在文档中被告知您得到了“可迭代对象”。但是，如果您查看pandas.io.parsers.py文件中的source code，您会发现TextFileReader对象是一个迭代器，因为该类包含一个__next__方法。 / p>

因此，在您的情况下，file是一个迭代器，在调用process函数一次之后就用完了。您可以通过numpy.array在此玩具示例中观察到类似的效果：

import numpy as np


arr1 = np.array([1, 2, 3])
arr2 = iter(arr1)


def process(file, arg):
    output = []
    for chunk in file:  # iterate through each chunk of the file
        val = chunk ** arg  # do something involving f and arg
        output.append(val)  # and incorporate this into output
    return output  # then return the result


outputs1 = []
for arg in [1, 2, 3]:
    outputs1.append(process(arr1, arg))

outputs2 = []
for arg in [1, 2, 3]:
    outputs2.append(process(arr2, arg))

然后您得到：

>>> outputs1
[[1, 2, 3], [1, 4, 9], [1, 8, 27]]
>>> outputs2
[[1, 2, 3], [], []]

带有chunksize参数的pandas read_csv产生一个只能使用一次的可迭代对象？

1 个答案: