使用Pandas附加数据帧以获取较大的csv文件

时间:2019-01-11 15:29:09

标签: python pandas csv dataframe

我正在尝试使用熊猫读取〜2GB的csv文件,并设置了一个for循环将文件分成较小的块。但是,尝试此操作时我仍然遇到MemoryError。

我最初的想法是将每个块添加到列表中,最后将列表合并到数据帧中。但是,当我运行循环时,我收到一个MemoryError,我认为可以通过分块避免。您是否无法在循环中添加内容?

# import pandas library
import pandas as pd

# parse gct file into dataframe
datafile = "test_data_1.gct"

# create list to store chunks
df_list = []

# read in datafram in chunks
for chunk in pd.read_csv(datafile, skiprows=2, sep='\t', chunksize=500):
    df_list.append(chunk)

运行此命令时,我得到以下信息:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1051, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 348, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 459, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\frame.py", line 7364, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4872, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4918, in form_blocks
    int_blocks = _multi_blockify(items_dict['IntBlock'])
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 4995, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
  File "C:\Users\samiv\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\internals.py", line 5037, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError

0 个答案:

没有答案
相关问题