如何使用dask libray将实木复合地板表(金字塔)附加到现有的实木复合地板文件中?

时间:2020-09-06 12:17:43

标签: pandas dataframe dask parquet pyarrow

我正在尝试实现将数据存储到拼花结构的功能,将其输出到拼花文件,如果输出拼花文件存在,请使用dask追加拼花结构

self.flag = True
    self.columns = ['original_ids', 'masked_ids', 'masked_labels', 'masked_positions']
    fields = [
          pa.field('original_ids',pa.list_(pa.int32())),
          pa.field('masked_ids',pa.list_(pa.int32())),
          pa.field('masked_labels', pa.list_(pa.int32())),
          pa.field('masked_positions', pa.list_(pa.int32())),
        ]
    
    self.myschema = pa.schema(fields)
    
    
    
    df_input = pd.DataFrame({'original_ids' : [original_ids], 'masked_ids' : [masked_ids],
                                               'masked_labels' : [masked_lm_ids],                              'masked_positions': [masked_lm_positions]})
    df_input = dd.from_pandas(df_input, npartitions=1)
    if self.flag:
       dd.to_parquet(df_input, self.output_f, engine='pyarrow', compression='gzip', write_index=False,
                                compute=True, append=False, ignore_divisions=True, schema=self.myschema)
       self.flag = False
    else:
       dd.to_parquet(df_input, self.output_f, engine='pyarrow', compression='gzip', write_index=False,
                                compute=True, append=True,  ignore_divisions=True, schema=self.myschema)

但是我遇到下一个错误:

  File "/usr/local/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py", line 424, in to_parquet
    **kwargs_pass
  File "/usr/local/lib/python3.7/site-packages/dask/dataframe/io/parquet/arrow.py", line 583, in initialize_write
    "Previous: {} | New: {}".format(names, list(df.columns))
ValueError: Appended columns not the same.
Previous: ['item', 'item', 'item', 'item'] | New: ['original_ids', 'masked_ids', 'masked_labels', 'masked_positions']

如何更改现有代码以解决此错误?

0 个答案:

没有答案