Question

您好我正在尝试创建一个pandas数据帧（一个dicts列表或一个dicts的词典），最终形状为60,000行和10,000个列

列的值为0或1且非常稀疏。

list / dict对象创建速度很快，但是当我执行from_dict或from_records时出现内存错误。我也尝试定期附加到数据帧而不是一次，但它仍然无效。我也尝试改变所有单个细胞，但没有用。

顺便说一句，我正在从我解析的100个json文件构建我的python对象。

如何从python对象转到数据帧？也许我也可以用别的东西。我最终想把它提供给sk-learn算法。

Answer 1

如果您只有0和1作为值，则应使用np.bool或np.int8作为dtype - 这将使您的内存消耗减少至少4倍。

这是一个小型演示：

In [34]: df = pd.DataFrame(np.random.randint(0,1,(60000, 10000)))

In [35]: df.shape
Out[35]: (60000, 10000)

In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 2.2 GB

每个默认的pandas使用np.int32（32位或4个字节）作为整数

让它向下转移到np.int8：

In [39]: df_int8 = df.astype(np.int8)

In [40]: df_int8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int8(10000)
memory usage: 572.2 MB

它现在消耗572 MB而不是2.2 GB（少4倍）

或使用np.bool：

In [41]: df_bool = df.astype(np.bool)

In [42]: df_bool.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: bool(10000)
memory usage: 572.2 MB

Answer 2

您可以尝试的另一件事是启用 pyarrow。

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

这使我对 pd.DataFrame 的调用速度提高了一个数量级！

（请注意，要使用 pyarrow，如果使用较新的 pyarrow，则必须使用 pyspark>=3.0.0（例如：pyarrow>=1.0.0）。对于 pyspark==2.x，使用 {{1} 是最简单的}.)

从dict或list开始的pandas数据帧太慢，有什么建议吗？

2 个答案: