" io.open"是不正确地打开UTF-16文件

时间:2014-10-17 20:06:36

标签: python file python-2.7 encoding io

io.open应该在以各种编码打开文件时剥离前导码。

例如,以UTF-8-SIG编码的以下文件在将其读入字符串之前正确地删除了前导码:

(注意:我没有以二进制模式打开这些文件。这些日志的第一行是演示即将读取的文件的内容。)

# Raw binary, so you can see that it's a proper UTF-8-SIG encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xef\xbb\xbf"EventId","Rate","Attribute1","Attribute2","(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89"\r\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-8-SIG').readline()
u'"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

但是,虽然这个带有UTF-16LE编码的文件正在成功打开,但序言随之而来:

# Raw binary, so you can see that it's a proper UTF-16LE encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xff\xfe"\x00E\x00v\x00e\x00n\x00t\x00I\x00d\x00"\x00,\x00"\x00R\x00a\x00t\x00e\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x001\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x002\x00"\x00,\x00"\x00(\x00a\xffe\xff\xc9\x03e\xffa\xff)\x00\x89\xff"\x00\r\x00\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-16LE').readline()
u'\ufeff"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

这继续打破文件验证,希望文件内容立即从"EventId"...

开始

我是否错误地打开了这个文件?

请注意,我不满意打开文件后手动删除前导码 - 我想支持任意编码,我希望io.open(提供正确的编码,由chardet确定)到如果遇到第一行的开头,我需要抽出一堆硬编码的前导码。

1 个答案:

答案 0 :(得分:2)

根据this answer,您需要使用UTF-16,而不是UTF-16LE

io.open(csv_file_path, encoding='UTF-16').readline()