Question

我有实木复合地板格式的数据，太大而无法放入内存（6 GB）。我正在寻找一种使用Python 3.6读取和处理文件的方法。有没有一种方法可以流传输文件，缩减采样并保存到dataframe？最终，我希望使用dataframe格式的数据。

我在不使用Spark框架的情况下尝试执行此操作是否错误？

我尝试使用pyarrow和fastparquet，但是在尝试读取整个文件时遇到内存错误。任何提示或建议，将不胜感激！

Answer 1

火花当然是执行此任务的可行选择。

我们计划今年在Sub Prime() Dim Last_Row1 As Long, Last_Row2 As Long Dim ws1 As Worksheet, ws2 As Worksheet Set ws1 = Sheets("Enter DATA here") Set ws2 = Sheets("DATA") Application.ScreenUpdating = False Last_Row1 = ws1.Range("C" & Rows.Count).End(xlUp).Row ' Determine the lastrow of the data to copy Last_Row2 = ws2.Range("A" & Rows.Count).End(xlUp).Row ' Determine the next empty row in order to paste the data ws1.Range("D21:O" & Last_Row1-4).Copy ws2.Range("A" & Last_Row2) Application.ScreenUpdating = True End Sub中添加流式读取逻辑（2019年，请参阅https://issues.apache.org/jira/browse/ARROW-3771和相关问题）。同时，我建议一次读取一个行组，以减轻内存使用问题。您可以使用pyarrow及其pyarrow.parquet.ParquetFile方法

Answer 2

这不是答案，我在这里发布，因为这是我可以在Stackoverflow上找到的唯一相关的帖子。我正在尝试使用read_row_group函数，但是python只会以代码139退出。没有其他错误消息，不确定如何解决。.

from pyarrow.parquet import ParquetFile
path = "sample.parquet"
f = ParquetFile(source = path)
print(f.num_row_groups) # it will print number of groups

# if I read the entire file:
df = f.read() # this works

# try to read row group
row_df = f.read_row_group(0)

# I get
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Python版本3.6.3

pyarrow版本0.11.1

流实木复合地板文件python和仅向下采样

2 个答案: