大块(600万行)pandas df在chunksize = 100时导致“ to_sql”导致内存错误,但可以很容易地在没有chunksize的情况下保存100,000个文件

时间:2019-05-29 22:59:14

标签: python sql pandas

我在Pandas创建了一个大型数据库,大约有600万行文本数据。我想将其保存为SQL数据库文件,但是当我尝试将其保存时,出现内存不足RAM错误。我什至将卡盘尺寸减小到100,但仍然崩溃。

但是,如果我只是该数据框的较小版本(具有100,000行),并且将其保存到未指定chucksize的数据库中,则保存该数据框没有任何问题。

这是我的代码

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)

我的理解是,由于一次只处理100行,因此RAM的使用应反映出节省100行的情况。幕后还有其他事情吗?也许是多线程?

在运行此代码之前,我使用的是4.8 GB RAM,而Google Colab中只有12.8 GB RAM。运行以上代码会耗尽所有RAM,直到环境崩溃为止。

我希望能够将我的pandas数据框保存到SQL文件,而不会导致环境崩溃。我所处的环境是Google Colab。熊猫数据名人堂是2列,约600万行。每个单元格包含大约这样的文本:

  

“优势序列转导模型基于复杂   编码器/解码器中的递归或卷积神经网络   组态。性能最佳的型号还将编码器和   解码器通过注意力机制。我们提出一个新的简单   网络架构,仅基于注意力的变压器   机制,完全消除了重复和卷积。   对两个机器翻译任务的实验表明,这些模型是   质量更高,同时更可并行化并需要   大大减少了培训时间。我们的模型在   WMT 2014英德翻译任务,在   现有的最佳效果,包括超过2位BLEU的合奏。在WMT上   2014年英语到法语翻译任务,我们的模型建立了一个新的   训练后的单人模型最新式BLEU分数为41.8   在八个GPU上使用3.5天,仅占文献中最佳模型的培训成本的一小部分。我们证明了变形金刚   成功地将其应用到英语中,从而很好地概括了其他任务   大量和有限的训练数据进行选区分析。”

编辑:

我在各个阶段进行了键盘中断。这是第一次跳入RAM后键盘中断的结果

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-22-51b6e444f80d> in <module>()
----> 1 dfAllT.to_sql("CS_table23", engine, chunksize = 100)

12 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
   2529         sql.to_sql(self, name, con, schema=schema, if_exists=if_exists,
   2530                    index=index, index_label=index_label, chunksize=chunksize,
-> 2531                    dtype=dtype, method=method)
   2532 
   2533     def to_pickle(self, path, compression='infer',

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
    458     pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
    459                       index_label=index_label, schema=schema,
--> 460                       chunksize=chunksize, dtype=dtype, method=method)
    461 
    462 

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method)
   1172                          schema=schema, dtype=dtype)
   1173         table.create()
-> 1174         table.insert(chunksize, method=method)
   1175         if (not name.isdigit() and not name.islower()):
   1176             # check for potentially case sensitivity issues (GH7815)

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in insert(self, chunksize, method)
    684 
    685                 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 686                 exec_insert(conn, keys, chunk_iter)
    687 
    688     def _query_iterator(self, result, chunksize, columns, coerce_float=True,

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
    597         """
    598         data = [dict(zip(keys, row)) for row in data_iter]
--> 599         conn.execute(self.table.insert(), data)
    600 
    601     def _execute_insert_multi(self, conn, keys, data_iter):

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
    986             raise exc.ObjectNotExecutableError(object_)
    987         else:
--> 988             return meth(self, multiparams, params)
    989 
    990     def _execute_function(self, func, multiparams, params):

/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
    285     def _execute_on_connection(self, connection, multiparams, params):
    286         if self.supports_execution:
--> 287             return connection._execute_clauseelement(self, multiparams, params)
    288         else:
    289             raise exc.ObjectNotExecutableError(self)

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
   1105             distilled_params,
   1106             compiled_sql,
-> 1107             distilled_params,
   1108         )
   1109         if self._has_events or self.engine._has_events:

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1246         except BaseException as e:
   1247             self._handle_dbapi_exception(
-> 1248                 e, statement, parameters, cursor, context
   1249             )
   1250 

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1466                 util.raise_from_cause(sqlalchemy_exception, exc_info)
   1467             else:
-> 1468                 util.reraise(*exc_info)
   1469 
   1470         finally:

/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    127         if value.__traceback__ is not tb:
    128             raise value.with_traceback(tb)
--> 129         raise value
    130 
    131     def u(s):

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1222                 if not evt_handled:
   1223                     self.dialect.do_executemany(
-> 1224                         cursor, statement, parameters, context
   1225                     )
   1226             elif not parameters and context.no_parameters:

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py in do_executemany(self, cursor, statement, parameters, context)
    545 
    546     def do_executemany(self, cursor, statement, parameters, context=None):
--> 547         cursor.executemany(statement, parameters)
    548 
    549     def do_execute(self, cursor, statement, parameters, context=None):

KeyboardInterrupt: 

这是如果我在崩溃之前立即进行键盘中断的结果

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-68b60fe221fe>", line 1, in <module>
    dfAllT.to_sql("CS_table22", engine, chunksize = 100)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
    dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
    table.insert(chunksize, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 599, in _execute_insert
    conn.execute(self.table.insert(), data)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1468, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py", line 129, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py", line 547, in do_executemany
    cursor.executemany(statement, parameters)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/usr/lib/python3.6/inspect.py", line 739, in getmodule
    f = getabsfile(module)
  File "/usr/lib/python3.6/inspect.py", line 708, in getabsfile
    _filename = getsourcefile(object) or getfile(object)
  File "/usr/lib/python3.6/inspect.py", line 693, in getsourcefile
    if os.path.exists(filename):
  File "/usr/lib/python3.6/genericpath.py", line 19, in exists
    os.stat(path)
KeyboardInterrupt

在撞车之前我又进行了一次跑步,这似乎给出了另一个不同的结果

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-f18004debe33>", line 1, in <module>
    dfAllT.to_sql("CS_table25", engine, chunksize = 100)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
    dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
    table.insert(chunksize, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in _execute_insert
    data = [dict(zip(keys, row)) for row in data_iter]
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in <listcomp>
    data = [dict(zip(keys, row)) for row in data_iter]
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/usr/lib/python3.6/inspect.py", line 742, in getmodule
    os.path.realpath(f)] = module.__name__
  File "/usr/lib/python3.6/posixpath.py", line 388, in realpath
    path, ok = _joinrealpath(filename[:0], filename, {})
  File "/usr/lib/python3.6/posixpath.py", line 421, in _joinrealpath
    newpath = join(path, name)
KeyboardInterrupt
---------------------------------------------------------------------------

我尝试过的其他事情:

使用dropna删除所有none / nan值

dfAllT = dfAllT.applymap(str)以确保我所有的值都是字符串

dfAllT.reset_index(drop = True,inplace = True)以确保索引未对齐。

编辑:

就像注释中提到的一样,我现在尝试在循环中使用to_sql。

for i in range(586147):
    print(i)
    dfAllT.iloc[i*10000:(i+1)*10000].to_sql('CS_table', engine, if_exists= 'append')

此操作最终会占用我的RAM,并最终导致大约一半的崩溃。我想知道这是否表明sqlite正在将所有内容保存在内存中,以及是否有解决方法。

编辑:

我尝试了一些其他的事情,缩短了卡盘,在每一步之后都将引擎进行了处理,然后创建了新引擎。最终还是吃光了所有RAM并崩溃了。

for i in range(586147):
    print(i)
    engine = sqlalchemy.create_engine("sqlite:///CSTitlesSummariesData.db")
    dfAllT.iloc[i*10:(i+1)*10].to_sql('CS_table', engine, index = False, if_exists= 'append')
    engine.dispose() 
    gc.collect 

我的想法:

因此,看起来整个数据库以某种方式以某种方式保留在活动内存中。

以此为基础的pandas数据帧为5 gigs(或者至少是我尝试将其转换为sqlite之前需要多少RAM)。我的系统死于大约12.72演出。我可以想象sqlite数据库占用的内存比pandas数据帧少。

2 个答案:

答案 0 :(得分:3)

我已经使用 df.to_sql 1 年了,现在我正在努力解决我运行大量资源但它不起作用的事实。我意识到chucksize 会使你的内存过载,pandas 加载到内存中,然后由chuncks 发送。我不得不直接使用sql进行控制。 (这里是我找到解决方案的地方 -> https://github.com/pandas-dev/pandas/issues/12265 我真的鼓励你读到最后。)

如果您需要从数据库中读取数据而不会使内存过载,请检查这段代码:

def get_data_by_chunks(cls, table, chunksize: int) -> iter:
    with MysqlClient.get_engine().begin() as conn:
        query_count = "select COUNT(*) from my_query"
        row_count = conn.execute(query_count, where).fetchone()[0]

        for i in range(math.ceil(row_count / chunksize)):
            query = """
               SELECT * FROM my_table
               WHERE my_filiters
               LIMIT {offset}, {row_count};
             """
            yield pd.read_sql(query, conn)

for df in get_data_by_chunks(cls, table, chunksize: int):
    print(df.shape)

答案 1 :(得分:1)

从逐步执行我认为的代码开始,它就是this line,它读取后会创建一堆DataFrame:

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

似乎是可能是错误。具体来说,这是在准备插入数据库之前发生的。

您可以做的一个窍门是在内存快速增加的同时按CTRL-C,然后看看哪一行停了下来(我敢打赌这就是这一行)。