使用pyarrow你如何附加到镶木地板文件?

时间:2017-11-04 17:59:40

标签: python pandas parquet pyarrow

如何使用Open filepath For Input Lock Read As textfile附加/更新rand.Seed(time.Now().UTC().UnixNano()) 文件?

parquet

我在文档中找不到任何关于附加镶木地板文件的内容。并且,您可以使用pyarrow进行多处理来插入/更新数据。

3 个答案:

答案 0 :(得分:8)

我遇到了同样的问题,我想我能够使用以下方法解决它:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


chunksize=10000 # this is the number of lines

pqwriter = None
for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
    table = pa.Table.from_pandas(df)
    # for the first chunk of records
    if i == 0:
        # create a parquet write object giving it an output file
        pqwriter = pq.ParquetWriter('sample.parquet', table.schema)
        pqwriter.write_table(table)
    # subsequent chunks can be written to the same file
    else:
        pqwriter.write_table(table)

# close the parquet writer
if pqwriter:
    pqwriter.close()

答案 1 :(得分:5)

一般来说,Parquet数据集由多个文件组成,因此您可以通过将其他文件写入数据所属的同一目录来追加。能够轻松连接多个文件会很有用。我打开https://issues.apache.org/jira/browse/PARQUET-1154以便在C ++(以及Python)中轻松完成这项工作

答案 2 :(得分:5)

在您的情况下,列名称不一致,我使三个示例数据框的列名一致,以下代码对我有效。

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def append_to_parquet_table(dataframe, filepath=None, writer=None):
    """Method writes/append dataframes in parquet format.

    This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked
    with writer, it appends dataframe to the already written pyarrow table.

    :param dataframe: pd.DataFrame to be written in parquet format.
    :param filepath: target file location for parquet file.
    :param writer: ParquetWriter object to write pyarrow tables in parquet format.
    :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame
        in the pyarrow Table
    """
    table = pa.Table.from_pandas(dataframe)
    if writer is None:
        writer = pq.ParquetWriter(filepath, table.schema)
    writer.write_table(table=table)
    return writer


if __name__ == '__main__':

    table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
    writer = None
    filepath = '/tmp/verify_pyarrow_append.parquet'
    table_list = [table1, table2, table3]

    for table in table_list:
        writer = append_to_parquet_table(table, filepath, writer)

    if writer:
        writer.close()

    df = pd.read_parquet(filepath)
    print(df)

输出:

   one  three  two
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz