Question

我认为英语.txt是Latin-1，但它可能包含其他编码的片段。是否有库或工具来定位这些片段？

我知道Python chardat库之类的东西，但我特别想找一个测试Latin-1文件和检测异常的工具。如果它能够告诉我它检测到非Latin-1模式并给我索引的话，即使是常规检测库也没关系。

特别欢迎命令行工具和Python库。

Answer 1

Latin-1（或者你的意思是它的带有欧元符号的latin-15变种？）并不容易被发现。

简单的方法可能是检查是否实际使用了一些未使用的字符（见表here） - 如果存在，则出现问题。但是，为了检测更微妙的违规行为，需要实际检查语言是否是其中之一，使用了latin-1。否则，无法区分8位编码。最好不要首先混合使用8位编码，而不是以某种方式标记编码的变化......

Answer 2

你认为文件（1）是Latin-1（2）可能包含另一种编码的片段的理由是什么？文件有多大？什么是“常规检测库”？您是否考虑过它可能是Windows编码，例如CP1252？

一些粗略的诊断：

# preliminaries
text = open('the_file.txt', 'rb').read()
print len(text), "bytes in file"

# How many non-ASCII bytes?
print sum(1 for c in text if c > '\x7f'), "non-ASCII bytes"

# Will it decode as UTF-8 OK?
try:
    junk = text.decode('utf8')
    print "utf8 decode OK"
except UnicodeDecodeError, e:
    print e

# Runs of more than one non-ASCII byte are somewhat rare in single-byte encodings
# of languages written in a Latin script ...
import re
runs = re.findall(r'[\x80-\xff]+', text)
nruns = len(runs)
print nruns, "runs of non-ASCII bytes"
if nruns:
    avg_rlen = sum(len(run) for run in runs) / float(nruns)
    print "average run length: %.2f bytes" % avg_rlen
# then if indicated you could write some code to display runs in context ...

在一个大多数拉丁文1文件中找到非Latin-1文本的片段？

2 个答案: