使用行终止符从csv导入字符串到dask数据帧

时间:2017-08-05 07:40:23

标签: python csv import dask

我有一个包含行终止符的字符串的csv我可以使用此代码导入panda:

df_desc = pd.read_csv(import_desc, sep="|")

但是当我尝试在dask数据帧中导入它时:

import dask.dataframe as ddf  
import_info = "data/info.csv"  
df_desc = ddf.read_csv(import_desc, sep="|", blocksize=None, dtype='str')

我收到此错误:

Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/data_extraction_dask.py", line 10, in <module>
    df_desc = ddf.read_table(import_desc, sep="|", blocksize=None, dtype='str')
  File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 323, in read
    **kwargs)
  File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 243, in read_pandas
    head = reader(BytesIO(b_sample), **kwargs)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 982, in read
    ret = self._engine.read(nrows)
  File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1719, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
  File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
  File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
  File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
  File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 130

文件提及:

  

还应注意,如果是CSV文件,此功能可能会失败   包括包含行终止符的带引号的字符串。要得到   在此周围,您可以指定blocksize = None以不将文件拆分为   多个分区,代价是降低了并行性。

这就是我使用blocksize = None的原因,但是这个函数使用了一个采样策略,它使用文件的第一个字节来确定列的类型,我认为会产生这个错误。

即使用dtypes指示类型,我也无法跳过采样步骤。

有解决方法吗?

0 个答案:

没有答案