Question

我想将csv文件从ASCII编码为UTF-8编码，这是我尝试过的代码：

import codecs
import chardet
BLOCKSIZE = 9048576 # or some other, desired size in bytes

with codecs.open("MFile2016-05-22.csv", "r", "ascii") as sourceFile:
    with codecs.open("tmp.csv", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)


file = open("tmp.csv", "r")
try:
    content = file.read()
finally:
    file.close()
        encoding = chardet.detect(content)['encoding']
print encoding

在测试之后，我仍然在编码值中得到“ascii”。编码没有改变。我错过了什么？

Answer 1

ASCII是UTF-8的子集;所有ASCII文件也自动为UTF-8。你不需要做任何事情。

Answer 2

ASCII是UTF-8的子集。任何ASCII编码的文件也是有效的UTF-8。

来自Wikipedia article on UTF-8：

Unicode的前128个字符（与ASCII一一对应）使用与ASCII相同的二进制值的单个八位字节进行编码，因此有效的ASCII文本也是有效的UTF-8编码的Unicode。

换句话说，您的操作是无操作，应该更改。

任何检测编解码器的工具（如chardet）都会正确地将其标记为ASCII。将其标记为UTF-8也是有效的，但也将其标记为ISO-8859-1（Latin-1）或CP-1252（基于Windows latin-1的代码页），或任何数量的编译器是超集的ASCII。但是，这会让人感到困惑，因为您的数据只包含ASCII码点。只接受ASCII 的工具会接受你的CSV文件，而他们不接受包含多个ASCII码点的UTF-8数据。

如果目标是使用chardet验证任何文本是有效的UTF-8，那么您也必须接受ASCII：

def is_utf8(content):
    encoding = chardet.detect(content)['encoding']
    return encoding in {'utf-8', 'ascii'}

将文件从ASCII编码为UTF8

2 个答案: