Question

我正在尝试正确存储以下字符串，这是https://play.google.com/store/tv/show?id=lXH-sW6govE的概要：

>>> s='''&quot;Work Out New York&quot; invites viewers to break a sweat
         with some of New York City’s hottest personal trainers...'''

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 60: ordinal not in range(128)

以下字符串需要对"之类的内容进行解除转义，但不应尝试解释撇号之类的内容，它会对其进行有效的重新编码。

如何正确解码并存储以下字符串？

Answer 1

正如@roippi指出的那样，你的HTML中有一个明智的引用，它正在打破HTMLParser.HTMLParser().unescape(s)。您需要传递HTMLParser.HTMLParser().unescape(s) Unicode而不是str。

如果您的HTML已添加到您的脚本中，那么您可以在编辑器中将编码设置为UTF-8并改为创建Unicode：

# coding=utf-8
s = u'''&quot;Work Out New York&quot; invites viewers to break a sweat
         with some of New York City’s hottest personal trainers...'''

使用# coding=utf-8，Python会自动将您的字符串解码为Unicode。

当你从远程源提取时，你应该使用适当的编码解码为Unicode。通过检查编码的“Content-type”标头或使用Requests HTTP库来执行此操作，并为您提供Request.text

的Unicode

您可能还需要考虑BeautifulSoup，它将帮助您在必要时浏览HTML DOM和unescape。同样，BeautifulSoup受益于解码的Unicode输入。

Answer 2

您可以使用以下内容：

def unescape(self, s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

使用普通的HTMLParser.HTMLParser（）不会。参考：https://wiki.python.org/moin/EscapingHtml。

HTMLescaping +传递特殊字符

2 个答案: