Question

io.open应该在以各种编码打开文件时剥离前导码。

例如，以UTF-8-SIG编码的以下文件在将其读入字符串之前正确地删除了前导码：

（注意：我没有以二进制模式打开这些文件。这些日志的第一行是演示即将读取的文件的内容。）

# Raw binary, so you can see that it's a proper UTF-8-SIG encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xef\xbb\xbf"EventId","Rate","Attribute1","Attribute2","(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89"\r\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-8-SIG').readline()
u'"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

但是，虽然这个带有UTF-16LE编码的文件正在成功打开，但序言随之而来：

# Raw binary, so you can see that it's a proper UTF-16LE encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xff\xfe"\x00E\x00v\x00e\x00n\x00t\x00I\x00d\x00"\x00,\x00"\x00R\x00a\x00t\x00e\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x001\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x002\x00"\x00,\x00"\x00(\x00a\xffe\xff\xc9\x03e\xffa\xff)\x00\x89\xff"\x00\r\x00\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-16LE').readline()
u'\ufeff"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

这继续打破文件验证，希望文件内容立即从"EventId"...

开始

我是否错误地打开了这个文件？

请注意，我不满意打开文件后手动删除前导码 - 我想支持任意编码，我希望io.open（提供正确的编码，由chardet确定）到如果遇到第一行的开头，我需要抽出一堆硬编码的前导码。

Answer 1

根据this answer，您需要使用UTF-16，而不是UTF-16LE。

io.open(csv_file_path, encoding='UTF-16').readline()

＆＃34; io.open＆＃34;是不正确地打开UTF-16文件

1 个答案: