Question

我遇到这个问题很麻烦。我试图比较来自两个不同数据库的两个不同的表，以查看已添加的元组，已删除的元组以及已更新的元组。我使用以下代码执行此操作：

from sqlalchemy import *

# query the databases to get all tuples from the relations
# save each relation to a list in order to be able to iterate over their tuples multiple times
# iterate through the lists, hash each tuple with k, v being primary key, tuple
# iterate through the "after" relation. for each tuple in the new relation, hash its key in the "before" relation. 
# If it's found and the tuple is different, consider that an update, else, do nothing.
# If it is not found, consider that an insert
# iterate through the "before" relation. for each tuple in the "before" relation, hash by the primary key
# if the tuple is found in the "after" relation, do nothing
# if not, consider that a delete.

 dev_engine = create_engine('mysql://...')
 prod_engine  = create_engine('mysql://...')

def transactions(exchange):
    dev_connect = dev_engine.connect()
    prod_connect = prod_engine.connect()

    get_dev_instrument = "select * from " + exchange + "_instrument;"
    instruments = dev_engine.execute(get_dev_instrument)
    instruments_list = [r for r in instruments]
    print 'made instruments_list'

    get_prod_instrument = "select * from " + exchange + "_instrument;"
    instruments_after = prod_engine.execute(get_prod_instrument)
    instruments_after_list = [r2 for r2 in instruments_after]
    print 'made instruments after_list'


    before_map = {}
    after_map = {}

    for row in instruments:
        before_map[row['instrument_id']] = row
    for y in instruments_after:
        after_map[y['instrument_id']] = y
    print 'formed maps'
    update_count = insert_count = delete_count = 0

    change_list = []
    for prod_row in instruments_after_list:
        result = list(prod_row)
        try:
            row = before_map[prod_row['instrument_id']]
            if not row == prod_row:
                update_count += 1
                for i in range(len(row)):
                    if not row[i] == prod_row[i]:
                        result[i] = str(row[i]) + '--->' + str(prod_row[i])
                result.append("updated")
                change_list.append(result)
        except KeyError:
            insert_count += 1
            result.append("inserted")
            change_list.append(result)

    for before_row in instruments_list:

        result = before_row
        try:
            after_row = after_map[before_row['instrument_id']]
        except KeyError:
            delete_count += 1
            result.append("deleted")
            change_list.append(result)

    for el in change_list:
        print el

    print "Insert: " + str(insert_count)
    print "Update: " + str(update_count)
    print "Delete: " + str(delete_count)

    dev_connect.close()
    prod_connect.close()

def main():

    transactions("...")

main()

instruments是＆＃34;之前＆＃34;表格和instruments_after是＆＃34;之后＆＃34;表，所以我希望看到将instruments更改为instruments_after时发生的更改。

上述代码运行良好，但instruments或instruments_after非常大时失败。我有一个超过400万行的表，只是尝试将其加载到内存中导致Python退出。我尝试通过在我的查询中使用LIMIT, OFFSET将instruments_list附加到片段中来克服此问题，但Python仍然会退出，因为这个大小的两个列表只占用太多空间。我的最后一个选择是从一个关系中选择一个批处理，并迭代第二个关系的批处理并进行比较，但这非常容易出错。还有另一种方法来规避这个问题吗？我曾考虑为我的VM分配更多内存，但我觉得我的代码的空间复杂性是问题，那应该首先修复。

使用SQLAlchemy

0 个答案: