我正在使用dask read_parquet读取文件列表,并连接这些数据帧并写入某些文件。在连接期间,是否在连接时先将所有数据读入内存中,还是仅加载架构的连接(我正在与轴0进行连接)??
预先感谢
答案 0 :(得分:3)
“默认情况下,Dask DataFrame是懒惰的”,请参见documentation,因此,除非您触发compute
,否则它仅适用于方案。
import pandas as pd
import dask.dataframe as dd
import numpy as np
df1 = pd.DataFrame(np.random.randn(10,2))
df2 = pd.DataFrame(np.random.randn(10,3))
ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
ddf = dd.concat([ddf1, ddf2])
print(ddf)
Dask DataFrame Structure:
0 1 2
npartitions=4
float64 float64 float64
... ... ...
... ... ...
... ... ...
... ... ...
Dask Name: concat, 8 tasks