Question

以下是我需要帮助的代码。我不得不运行它超过1,300,000行意味着它需要 40分钟来插入~300,000行。

我认为批量插入是加快速度的途径吗？或者是因为我通过for data in reader:部分迭代行？

#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
    #hooks csv reader to file
    reader = csv.reader(f)
    #pulls out the columns (which match the SQL table)
    columns = next(reader)
    #trims any extra spaces
    columns = [x.strip(' ') for x in columns]
    #starts SQL statement
    query = 'bulk insert into SpikeData123({0}) values ({1})'
    #puts column names in SQL query 'query'
    query = query.format(','.join(columns), ','.join('?' * len(columns)))

    print 'Query is: %s' % query
    #starts curser from cnxn (which works)
    cursor = cnxn.cursor()
    #uploads everything by row
    for data in reader:
        cursor.execute(query, data)
        cursor.commit()

我是故意动态选择我的列标题（因为我想创建最可能的pythonic代码）。

SpikeData123是表名。

Answer 1

如对另一个答案的评论中所述，T-SQL BULK INSERT命令仅在要导入的文件与SQL Server实例位于同一台计算机上或位于SMB / CIFS网络位置时才有效SQL Server实例可以读取。因此，它可能不适用于源文件位于远程客户端的情况。

pyodbc 4.0.19添加了Cursor#fast_executemany功能，在这种情况下可能会有所帮助。 fast_executemany默认为“关闭”，以及以下测试代码......

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

...在我的测试机器上执行大约需要22秒。只需添加crsr.fast_executemany = True ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

crsr.fast_executemany = True  # new in pyodbc 4.0.19

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

...将执行时间缩短到1秒以上。

Answer 2

更新：正如@SimonLang的评论中所述，SQL Server 2017及更高版本下的BULK INSERT显然支持CSV文件中的文本限定符（参考：here）。

BULK INSERT几乎肯定会比逐行读取源文件并为每行执行常规INSERT更快。但是，BULK INSERT和BCP都对CSV文件有很大的限制，因为它们无法处理文本限定符（参考：here）。也就是说，如果你的CSV文件不中有合格的文本字符串......

1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07

...然后你可以BULK INSERT它，但如果它包含文本限定符（因为一些文本值包含逗号）...

1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07

...然后BULK INSERT无法处理它。但是，将这样的CSV文件预处理为管道分隔文件可能会更快......

1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07

...或制表符分隔文件（其中→表示制表符）...

1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07

...然后BULK INSERT那个文件。对于后者（制表符分隔）文件，BULK INSERT代码看起来像这样：

import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
    FIELDTERMINATOR='\\t',
    ROWTERMINATOR='\\n'
    );
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

注意：如注释中所述，只有SQL Server实例可以直接读取源文件时，才能执行BULK INSERT语句。对于源文件位于远程客户端的情况，请参阅this answer。

Answer 3

是批量插入是将大文件加载到数据库的正确路径。一眼就能看出它花费这么长时间的原因就像你提到的那样，你正在循环文件中的每一行数据，这实际上意味着消除了使用批量插入并使其像普通插入一样的好处。请记住，因为它的名字意味着它用于插入数据的chucks。我会删除循环，然后再试一次。

另外，我会仔细检查批量插入的语法，因为它看起来不正确。检查pyodbc生成的sql，因为我觉得它可能只是执行正常的插入

或者，如果它仍然很慢，我会尝试直接从sql使用批量插入，并将整个文件加载到带有批量插入的临时表中，然后将相关列插入到右表中。或者使用批量插入和bcp的混合来插入特定列或OPENROWSET。

Answer 4

这个问题让我很沮丧，直到我在 SO 上找到这篇文章后，我才发现使用 fast_executemany 并没有太大的改进。具体来说，Bryan Baiilliache 关于 max varchar 的评论。我一直在使用 SQLAlchemy，即使确保更好的数据类型参数也不能解决我的问题；但是，切换到 pyodbc 确实如此。我还采纳了 Michael Moura 的使用临时表的建议，发现它节省了更多时间。我写了一个函数，以防有人发现它有用。我写它是为了获取插入的列表或列表列表。我使用 SQLAlchemy 和 Pandas to_sql 插入相同的数据，从有时需要 40 分钟到不到 4 秒。不过，我可能误用了我以前的方法。

<块引用>

连接

def mssql_conn():
    conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                          server=os.environ.get('MS_SQL_SERVER'),
                          database='EHT',
                          uid=os.environ.get('MS_SQL_UN'),
                          pwd=os.environ.get('MS_SQL_PW'),
                          autocommit=True)
    return conn

<块引用>

插入函数

def mssql_insert(table,val_lst,truncate=False,temp_table=False):
    '''Use as direct connection to database to insert data, especially for
       large inserts. Takes either a single list (for one row),
       or list of list (for multiple rows). Can either append to table
       (default) or if truncate=True, replace existing.'''
    conn = mssql_conn()
    cursor = conn.cursor()
    cursor.fast_executemany = True
    tt = False
    qm = '?,'
    if isinstance(val_lst[0],list):
        rows = len(val_lst)
        params = qm * len(val_lst[0])
    else:
        rows = 1
        params = qm * len(val_lst)
        val_lst = [val_lst]
    params = params[:-1]
    if truncate:
        cursor.execute(f"TRUNCATE TABLE {table}")
    if temp_table:
        #create a temp table with same schema
        start_time = time.time()
        cursor.execute(f"SELECT * INTO ##{table} FROM {table} WHERE 1=0")
        table = f"##{table}"
        #set flag to indicate temp table was used
        tt = True
    else:
        start_time = time.time()
    #insert into either existing table or newly created temp table
    stmt = f"INSERT INTO {table} VALUES ({params})"
    cursor.executemany(stmt,val_lst)
    if tt:
        #remove temp moniker and insert from temp table
        dest_table = table[2:]
        cursor.execute(f"INSERT INTO {dest_table} SELECT * FROM {table}")
        print('Temp table used!')
        print(f'{rows} rows inserted into the {dest_table} table in {time.time() - 
              start_time} seconds')
    else:
        print('No temp table used!')
        print(f'{rows} rows inserted into the {table} table in {time.time() - 
              start_time} seconds')
    cursor.close()
    conn.close()

我的控制台结果首先使用临时表，然后不使用临时表（在这两种情况下，该表都包含执行时的数据并且 Truncate=True）：

No temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 10.595500707626343 
seconds

Temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 3.810380458831787 
seconds

Answer 5

FWIW，我给出了一些插入 SQL Server 的方法，我自己进行了一些测试。通过使用 SQL Server 批处理和 pyodbcCursor.execute 语句，我实际上能够获得最快的结果。我没有测试保存到 csv 和 BULK INSERT，我想知道它是如何比较的。

这是我关于测试的博客： http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html

如何使用pyodbc从CSV加速批量插入到MS SQL Server

5 个答案: