Python Webscraping编码

时间:2015-08-27 15:48:28

标签: python-3.x encoding decode url-encoding

我似乎无法让程序识别u'\ xe9'(即“é”)。它似乎是在ascii中读取页面,这可能是问题所在。因此无法正确打印“coupé”。任何想法如何解决这个问题?

from lxml import html
import requests

new_list = []
page=requests.get('http://www.carfolio.com/specifications/models/?man=557')
tree=html.fromstring(page.text)
model_name = tree.xpath('//span[@class="model name"]/text()'.encode('utf-8'))
for elem in model_name:
    new_list.append(elem)
    if u'\xe9' in elem:
        u'\xe9'.encode('latin-1')
        print(elem)

我以前从未处理过编码问题。我可以很容易地删除包含那个麻烦的字节的元素,但这就是我需要的数据。如果我切换编码,它会给我更奇怪的结果。

* python 3

1 个答案:

答案 0 :(得分:0)

from lxml import html
import requests

new_list = []
page=requests.get('http://www.carfolio.com/specifications/models/?man=557')
tree=html.fromstring(page.text)
model_name = tree.xpath('//span[@class="model name"]/text()'.encode('utf-8'))
print(len(model_name))
for elem in model_name:
    for char in elem:
        if "é" not in char:
            print(char, end='')
    print(' ')

这至少保留了相同数量的元素,只是忽略了那个麻烦的é。