Question

我有一个金字塔视图，用于将数据从大文件加载到数据库中。对于文件中的每一行，它会进行一些处理，然后创建一些模型实例并将它们添加到会话中。除非文件很大，否则这样可以正常工作。对于大型文件，视图会慢慢吞噬我的所有内存，直到一切都有效地停止。

所以我的想法是用一个创建会话的函数单独处理每一行，创建必要的模型实例并将它们添加到当前会话，然后提交。

def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
    from sqlalchemy.orm import (
            scoped_session,
            sessionmaker,
    )
    from sqlalchemy import engine_from_config
    from pyramidapp.models import Base, DataEntry
    from zope.sqlalchemy import ZopeTransactionExtension
    import transaction

    oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
    engine = engine_from_config(dSettings, 'sqlalchemy.')
    oCurrentDBSession.configure(bind=engine)
    Base.metadata.bind = engine

    oEntry = DataEntry()
    oCurrentDBSession.add(oEntry)
    ...
    transaction.commit()

我对此功能的要求如下：

创建会话（检查）
制作一堆模型实例（检查）
将这些实例添加到会话中（检查）
将这些模型提交到数据库
摆脱会话（以便它和2中创建的对象被垃圾收集）

我确保新创建的会话在必要时作为参数传递，以便阻止多个会话的错误等等。可惜！我无法让数据库连接消失，而且没有提交内容。

我尝试将该功能分离到芹菜任务中，以便视图执行完成并完成它所需要的但是我在芹菜中遇到关于拥有太多mysql连接的错误，无论我在提交和尝试方面做什么关闭和处理，我不知道为什么。是的，当我进行更改时，我重新启动芹菜服务器。

当然有一种简单的方法可以做到这一点？我想做的就是做一个会话提交，然后离开，让我一个人呆着。

Answer 1

为你的大文件的每一行创建一个新的会话，我想象的会很慢。

我要尝试的是提交会话并每1000行左右从其中删除所有对象：

counter = 0

for line in mymegafile:
    entry = process_line(line)
    session.add(entry)
    if counter > 1000:
        counter = 0
        transaction.commit()  # if you insist on using ZopeTransactionExtension, otherwise session.commit()
        session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
    else:
        counter += 1

如果没有从任何地方引用DataEntry实例，那么Python解释器在某些时候应该对它们进行垃圾收集。

但是，如果您在该视图中所做的只是将新记录插入数据库，那么使用SQLAlchemy Core构造或文字SQL批量插入数据可能会更有效。这也可以解决你的ORM实例占用RAM的问题。有关详细信息，请参阅I’m inserting 400,000 rows with the ORM and it’s really slow!。

Answer 2

所以我尝试了很多东西，虽然使用SQLAlchemy的内置功能来解决这个问题，但我找不到任何方法来解决这个问题。

所以这里是我所做的概述：

将要处理的行分开分批
为每批线排队一个芹菜任务来处理这些线
在芹菜任务中启动了一个单独的过程，用线条完成必要的工作。

推理：

批次很明显
使用Celery是因为处理整个文件需要花费很长时间才能排队才有意义
任务启动了一个单独的过程，因为如果没有，那么我遇到了与金字塔应用程序相同的问题

一些代码：

芹菜任务：

def commit_lines(lLineData,dSettings,cwd):
    """
    writes the line data to a file then calls a process that reads the file and creates
    the necessary data entries. Then deletes the file
    """
    import lockfile
    sFileName = "/home/sheena/tmp/cid_line_buffer"
    lock = lockfile.FileLock("{0}_lock".format(sFileName))
    with lock:
        f = open(sFileName,'a') #in case the process was at any point interrupted...
        for d in lLineData:
            f.write('{0}\n'.format(d))
        f.close()

    #now call the external process
    import subprocess
    import os
    sConnectionString = dSettings.get('sqlalchemy.url')
    lArgs = [
                'python',os.path.join(cwd,'commit_line_file.py'),
                '-c',sConnectionString,
                '-f',sFileName
        ]
    #open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
    subprocess.check_call(lArgs,shell=False)
    #and clear the file
    lock = lockfile.FileLock("{0}_lock".format(sFileName))
    with lock:
        f = open(sFileName,'w')
        f.close()

外部流程：

"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
    from optparse import OptionParser
    from sqlalchemy import create_engine
    from pyramidapp.models import Base,DBSession

    import ast
    import transaction

    #get options

    oParser = OptionParser()
    oParser.add_option('-c','--connection_string',dest='connection_string')
    oParser.add_option('-f','--input_file',dest='input_file')
    (oOptions, lArgs) = oParser.parse_args()

    #set up connection

    #engine = engine_from_config(dSettings, 'sqlalchemy.')
    engine = create_engine(
        oOptions.connection_string,
        echo=False)
    DBSession.configure(bind=engine)
    Base.metadata.bind = engine

    #commit stuffs
    import lockfile
    lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
    with lock:
        for sLine in open(oOptions.input_file,'r'):
            dLine = ast.literal_eval(sLine)
            create_entry(**dLine)

    transaction.commit()

def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
    #import stuff
    oEntry = DataEntry()
    #do some other stuff, make more model instances...
    DBSession.add(oEntry)


if __name__ == "__main__":
    main()

在视图中：

 for line in big_giant_csv_file_handler:
     lLineData.append({'stuff':'lots'})

 if lLineData:
            lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
            for l in lLineSets:
                commit_lines.delay(l,dSettings,sCWD)  #queue it for celery

Answer 3

你只是做错了。期。

引自SQLAlchemy docs

高级开发人员会尝试保留会话的详细信息，事务和异常管理尽量远离该计划正在开展工作的细节。

引自Pyramid docs

我们决定使用SQLAlchemy与我们的数据库进行通信。不过，我们还安装了pyramid_tm和zope.sqlalchemy。

为什么呢？

金字塔在支持交易方面有很强的定位。   具体来说，您可以在您的应用中安装事务管理器   应用程序，作为中间件或金字塔“补间”。然后，只是   在您返回响应之前，您的所有交易感知部分   应用程序已执行。这意味着金字塔视图代码通常不会   管理交易。

我今天的回答不是代码，而是建议遵循您正在使用的软件包/框架的作者推荐的最佳实践。

参考

Answer 4

封装CSV读取并将SQLAlchemy模型实例创建为支持迭代器协议的内容。我称之为BatchingModelReader。它返回DataEntry实例的集合，集合大小取决于批量大小。如果模型随时间变化，则无需更改芹菜任务。该任务仅将一批模型放入会话中并提交事务。通过控制批量大小，您可以控制内存消耗。 BatchingModelReader和celery任务都没有保存大量的中间数据。这个例子也表明使用芹菜只是一种选择。我添加了一个金字塔应用程序代码示例的链接，我实际上是在Github fork重构。

BatchingModelReader - 封装csv.reader并使用金字塔应用程序中的现有模型

受到csv.DictReader
源代码的启发

可以作为芹菜任务运行 - 使用适当的任务装饰器

from .models import DBSession
import transaction

def import_from_csv(path_to_csv, batchsize)
    """given a CSV file and batchsize iterate over batches of model instances and import them to database"""
    for batch in BatchingModelReader(path_to_csv, batchsize):
        with transaction.manager:
            DBSession.add_all(batch)

金字塔视图 - 只需保存大型巨型CSV文件，启动任务，立即返回响应

@view_config(...):
def view(request):
    """gets file from request, save it to filesystem and start celery task"""
    with open(path_to_csv, 'w') as f:
        f.write(big_giant_csv_file)

    #start task with parameters
    import_from_csv.delay(path_to_csv, 1000)

代码示例

使用SQLAlchemy的金字塔

Databases using SQLAlchemy

SQLAlchemy internals

如何提交模型实例并一次从工作内存中删除它们

4 个答案: