Python中的UnicodeDecodeError

时间:2015-09-17 06:09:51

标签: python

我有一个文本文件,其大小超过200 MB。我想阅读它,然后想要选择30个最常用的单词。当我运行它时,它给我错误。代码如下: -

    import sys, string 
    import codecs 
    from collections import Counter
    import collections
    import unicodedata
    with open('E:\\Book\\1800.txt', "r", encoding='utf-8') as File_1800:
    for line in File_1800: 
       sepFile_1800 = line.lower()
        words_1800 = re.findall('\w+', sepFile_1800)
    for wrd_1800 in [words_1800]:
        long_1800=[w for w in wrd_1800 if len(w)>3]
        common_words_1800 = dict(Counter(long_1800).most_common(30))
    print(common_words_1800)


    Traceback (most recent call last):
    File "C:\Python34\CommonWords.py", line 14, in <module>
    for line in File_1800:
    File "C:\Python34\lib\codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position    
    3784: invalid start byte

2 个答案:

答案 0 :(得分:1)

该文件不包含'UTF-8'个编码数据。找到正确的编码并更新该行:with open('E:\\Book\\1800.txt', "r", encoding='correct_encoding')

答案 1 :(得分:0)

尝试使用encoding='latin1'代替utf-8

另外,在这些方面:

for line in File_1800:
    sepFile_1800 = line.lower()
    words_1800 = re.findall('\w+', sepFile_1800)
for wrd_1800 in [words_1800]:
    ...

该脚本正在为每一行重新分配re.findallwords_1800变量的匹配项。因此,当您到达for wrd_1800 in [words_1800]时,words_1800变量只包含最后一行的匹配项。

如果您想进行最小的更改,请在迭代文件之前初始化一个空列表:

words_1800 = []

然后将每行的匹配项添加到列表中,而不是替换列表:

words_1800.extend(re.findall('\w+', sepFile_1800))

然后你可以做(​​没有第二个for循环):

long_1800 = [w for w in words_1800 if len(w) > 3]
common_words_1800 = dict(Counter(long_1800).most_common(30))
print(common_words_1800)