Question

据我了解，HDFStore.select是工具，用于从大型数据集中进行选择。但是，当尝试使用chunksize和iterator=True遍历块时，一旦底层数据集足够大，迭代器本身就会变成一个非常大的对象，我不明白为什么迭代器对象很大，它包含的信息必须变得如此之大。

我有一个非常大的HDFStore结构（7个行，420 GB的磁盘），我想按块进行迭代：

iterator = HDFStore.select('df', iterator=True, chunksize=chunksize)

for i, chunk in enumerate(iterator):
    # some code to apply to each chunk

当我为一个相对较小的文件运行此代码时 - 一切正常。但是，当我尝试将它应用于7bn行数据库时，我在计算迭代器时得到Memory Error。我有32 GB RAM。

我希望有一个生成器来随时创建块，它不会存储到RAM中，例如：

iteratorGenerator = lambda: HDFStore.select('df', iterator=True, chunksize=chunksize)

for i, chunk in enumerate(iteratorGenerator):
    # some code to apply to each chunk

但是iteratorGenerator不可迭代，所以这也不起作用。

我可能会将HDFStore.select循环到start和stop行，但我认为应该有更优雅的迭代方式。

Answer 1

我对（仅）30GB文件有同样的问题，显然你可以通过强制垃圾收集器完成它来解决它...收集！：P PS：你也不需要一个lambda，select调用将返回一个迭代器，只是循环它，就像你在第一个代码块那样。

with pd.HDFStore(file_path, mode='a') as store:
    # All you need is the chunksize
    # not the iterator=True
    iterator = store.select('df', chunksize=chunksize)

    for i, chunk in enumerate(iterator):

        # some code to apply to each chunk

        # magic line, that solved my memory problem
        # You also need "import gc" for this
        gc.collect()

HDFStore.select中的迭代器和chunksize：“内存错误”

1 个答案: