
时间:2017-04-21 17:59:01

标签: python postgresql pandas sqlalchemy

我遇到了性能瓶颈问题,试图同时更新两个表中的多个记录。目前,我有一些Pandas DataFrames(new_records_dfmodified_records_df)包含我想要插入/更新的记录。请参阅以下psuedocode:

if not new_records_df.empty:
    new_recs_data = new_records_df.T.to_dict().values()  # creates a list of dictionaries from the DataFrame
    new_recs = []
    for r in new_recs_data:
        new_rec = {'foo_id': foo_id,
                   'bar': bar}
    db_session.bulk_insert_mappings(Record, new_recs, return_defaults=True) # return_defaults inserts the id of the inserted record into the dictionary object
    new_related_recs = []
    for nr in new_recs:
        new_related_rec = {'rec_id': nr['id'],
                           'baz': baz}
    db_session.bulk_insert_mappings(RelatedRec, new_related_recs)

if not modified_records_df.empty:
    modified_rec_data = modified_records_df.T.to_dict().values()  # again, converting teh DataFrame to a list of dicts
    modified_recs = []
    for m in modified_rec_data:
        modified_rec = {'id': m['id'],
                        'zab': zab}
    db_session.bulk_update_mappings(RelatedRec, modified_recs)  # when a record is modified, only the RelatedRec object is updated. The Record object already exists and stays unmodified

问题是,对于~8k记录,字典上的循环需要大约20秒,而实际的数据库插入/更新只需要大约4秒。我希望有一种聪明的方法可以消除for循环,因为这似乎是瓶颈。我的数据库是postgres,我的驱动程序是psycoppg2 2.6.2

0 个答案:
