这个编码是什么以及如何转换它?

时间:2012-12-08 03:45:07

标签: python string xpath encoding format

我通过tag.attrib['title']使用lxml和xpath从html标记属性中提取文本。我明白了:

Málaga Airport

在浏览器中我看到的网址相同:

Málaga Airport

如何将前者转换为后者?

1 个答案:

答案 0 :(得分:2)

似乎lxml html解析器对字节字符串采用'latin1'编码。

因此,除非输入被编码为'latin1'(或'ascii'),否则需要明确指定编码。在这种情况下,它看起来应该是'utf-8':

>>> from lxml import etree
>>>
>>> html = u"""
... <html>
... <head><title>Test</title></head>
... <body>
... <p test="Málaga">Example</p>
... </body>
... </html>
... """
>>>
>>> html = html.encode('utf-8')
>>>
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> print tree.xpath('//p/@test')[0]
Málaga
>>>
>>> parser = etree.HTMLParser(encoding='utf-8')
>>> tree = etree.fromstring(html, parser)
>>> print tree.xpath('//p/@test')[0]
Málaga
相关问题