使用美丽的汤编码错误:字符映射到未定义(Python)

时间:2018-05-29 06:12:24

标签: python html encoding beautifulsoup

我编写了一个脚本,该脚本应该从网站上检索html页面并更新其内容。以下函数在我的系统中查找某个文件,然后尝试打开并编辑它:

def update_sn(files_to_update, sn, table, title):
    paths = files_to_update['files']
    print('updating the sn')
    try:
        sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
        notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]

    except Exception:
        print('no sns were found')
        pass

    new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
    new_sn_number = sn

    htm_text = open(sn_htm, 'rb').read().decode('cp1252')
    content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S) 
    minus_content = htm_text.replace(content[0], '')
    table_soup = BeautifulSoup(table, 'html.parser')
    new_soup = BeautifulSoup(minus_content, 'html.parser')
    head_title = new_soup.title.string.replace_with(new_sn_number)
    new_soup.link.insert_after(table_soup.div.next)

    with open(new_path_name, "w+") as file:
        result = str(new_soup)
        try:
            file.write(result)
        except Exception:
            print('Met exception.  Changing encoding to cp1252')
            try:
                file.write(result('cp1252'))
            except Exception:
                print('cp1252 did\'nt work.  Changing encoding to utf-8')
                file.write(result.encode('utf8'))
                try:
                    print('utf8 did\'nt work.  Changing encoding to utf-16')
                    file.write(result.encode('utf16'))
                except Exception:
                    pass

这适用于大多数情况,但有时它无法写入,此时异常启动并且我尝试每个可行的编码而没有成功:

updating the sn
Met exception.  Changing encoding to cp1252
cp1252 did'nt work.  Changing encoding to utf-8
Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
    file.write(result)
  File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scraper.py", line 79, in <module>
    get_latest(entries[0], int(num), entries[1])
  File "scraper.py", line 56, in get_latest
    update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
    file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes

任何人都可以给我任何关于如何更好地处理可能具有不一致编码的html数据的指针吗?

2 个答案:

答案 0 :(得分:1)

出于好奇,这行代码是错字file.write(result('cp1252'))?似乎缺少.encode方法。

Traceback (most recent call last):
  File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
    file.write(result('cp1252'))
TypeError: 'str' object is not callable

如果将代码修改为:file.write(result.encode('cp1252'))

,它是否会完美运行

我曾经写过这个带编码问题的文件,并通过以下帖子提出了我自己的解决方案:

Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

我的问题通过将html.parser解析模式更改为html5lib来解决。由于HTML标记格式错误导致我的问题,并使用html5lib解析器解决了问题。供您参考,这是BeautifulSoup提供的每个解析器的documentation

希望这有帮助

答案 1 :(得分:1)

在您的代码中,您以文本模式打开文件,但之后您尝试写入字节(z返回字节),因此Python抛出异常:

n

如果要写字节,则应以二进制模式打开文件。

BeautifulSoup检测文档的编码(如果是字节)并自动将其转换为字符串。我们可以使用str.encode访问编码,并在写入文件时使用它对内容进行编码。例如,

TypeError: write() argument must be str, not bytes

为了使其正常工作,您应该将html作为字节传递给.original_encoding,因此不要解码响应内容。

如果BeautifulSoup由于某种原因无法检测到正确的编码,那么您可以尝试一系列可能的编码,就像您在代码中所做的那样。

soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii

with open('my.file', 'wb+') as file:
    file.write(data.encode(encoding))

或者,您可以在文本模式下打开文件并在BeautifulSoup中设置编码(而不是编码内容),但请注意,此选项在Python2中不可用。