如何使用python将UTF-8文件分成单独的行(逗号分隔)?

时间:2014-08-26 22:03:32

标签: python csv utf-8

我正在尝试将UTF-16文件转换为UTF-8文件(因为我使用的是显然不处理UTF-16文件的python csv模块)。然后我想分界这个UTF-8文件,这样我就可以使用简单的row.strip()方法将它导入到postgres表中。这个python文件看起来像:

with codecs.open(sourcefile, 'rU', 'UTF-16') as infile:
   with open(sourcefile + '.utf8', 'wb') as outfile:
       for line in infile:
           outfile.write(line.encode('utf8'))


with open(sourcefile + '.utf8', 'rb') as f:
    reader = csv.reader(f, delimiter=',')

    for row in reader:
        print row[1]

我无法分隔行,因为行似乎只有一个索引,而打印行[1]让我处于索引范围内 - 如何拆分此文件?

Excel行:

15,"1/2 TYPE A","98","MCDS, TX","XA","852","TX","955","148","HAPPY, TX",,"$0.00","0","0.00","$1,504","179","0.00%","100.00%","0"
32,"1/2 TYPE B","98","MCDS, MI","XA","252","MI","72","925","HAPPY, MI",,"$0.00","0","0.00","$2,504","225","0.00%","100.00%","0"

我很抱歉没有描述性。基本上输入文件是UTF-16文件。我以前用excel打开文件,将一列分成多列,分隔符为','并将其另存为csv文件。然后我通过python脚本运行该操作的csv文件,该脚本能够读取csv文件并剥离行并将数据导入postgres数据库。

python脚本的原始导入部分(当我用分隔符','分开时)看起来像这样(简化版):

 for row in reader:
    arg = {
            'item_number': row[0].strip(),
            'item_size': row[1].strip(),
            'description': row[2].strip(),
            #etc...
        }
        cur.execute(
            """INSERT INTO 
            "Sales"("ITEM_NUMBER","ITEM_SIZE","DESCRIPTION"")
             select
                %(item_number)s, 
                %(item_size)s )
                %(description);""", arg)

但是,我现在希望能够使用我的python脚本简单地运行UTF-16文件以将数据导入到postgres中,因此我不必在excel中打开该文件。我想通过将文件转换为UTF-8文件然后以某种方式剥离每一行并将其导入我的数据库来实现。

我已经能够成功地将文件转换为UTF-8但现在我遇到的问题是UTF-8文件基本上是一堆被视为"一列"的行。我怎么去剥离每一行?我不能做一个简单的row [0] .strip(),因为属于描述的文件中有一些逗号。

1 个答案:

答案 0 :(得分:0)

不要创建中间文件,只需使用描述in the docs的变换(搜索unicode_csv_reader)。为方便起见,我已将生成器转换为生成器表达式:

import codecs
import csv

sourcefile = 'csv16.csv'
with codecs.open(sourcefile, 'rU', 'UTF-16') as infile:
    reader = csv.reader((line.encode('utf-8')
                         for line in infile),
                        delimiter=',')
    for row in ([item.decode('utf-8')
                 for item in row]
                for row in reader):
        print u'/'.join(row)

我已针对以下文件测试了上面的代码,保存为Big-endian UTF-16:

1,2,3,4
5,6,7,8
"98°","①", "®©§™"

输出:

1/2/3/4
5/6/7/8
98°/①/ "®©§™"
相关问题