lxml.etree.XPathEvalError:表达式

时间:2016-05-16 13:50:11

标签: python xpath lxml

我收到Python的错误,我无法理解。我已将代码简化为最低限度:

response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/@href')
print(r)

仍然出现错误

Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/@href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression

有人会知道问题的来源吗?可能是依赖问题?感谢。

2 个答案:

答案 0 :(得分:1)

表达式'//divass="campaign"]/a/@href'在语法上不正确,没有多大意义。相反,您打算检查class属性:

//div[@class="campaign"]/a/@href

现在,这将有助于避免无效表达式错误,但您不会得到表达式找不到任何内容。这是因为requests收到的响应中没有数据。您需要模仿浏览器为获取所需数据所做的工作,并提出额外请求以获取包含广告系列的javascript文件。

这对我有用:

import ast
import re

import requests
from lxml import html

with requests.Session() as session:
    # extract script url
    response = session.get('http://pycoders.com/archive')
    tree = html.fromstring(response.text)
    script_url = tree.xpath("//script[contains(@src, 'generate-js')]/@src")[0]

    # get the script
    response = session.get(script_url)
    data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))

    # extract the desired data
    tree = html.fromstring(data)
    campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[@class="campaign"]/a')]
    print(campaigns)

打印:

['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140', 
 ...
 'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]

答案 1 :(得分:0)

制作xpath时出错了。 如果你想要获取所有href,你的xpath应该是

hrefs = tree.xpath('//div[@class="campaign"]/a')
for href in hrefs:
    print(href.get('href'))

或一行:

hrefs = [item.get('href') for item in tree.xpath('//div[@class="campaign"]/a')]