XPath总是返回一个空列表

时间:2019-02-08 15:22:55

标签: python xpath web-crawler

好吧,我不知道为什么'title_List'总是不返回任何内容。
我只是尝试更改“用户代理”,但结果是相同的。

有人可以告诉我我的代码哪里出问题了吗?

通过使用chrome xpath-helper,Xpath正确,如下面的img。

enter image description here

这是我的代码:

#coding=utf-8
import re
import urllib2
import urllib
from lxml import etree


def init():

    url = 'https://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3&ie=utf-8&pn=0'
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"}
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request).read()
    print(1)
    print(response)
    #shape response get data
    get_title(response)
    print(4)



#get title href
def get_title(response):
    #html->xpath
    html_dom = etree.HTML(response)

    ts = html_dom.xpath('//div[@class="threadlist_lz clearfix"]/div/a[@class="j_th_tit"]/@href')
    print(2)
    print(ts)
    for href in ts:
        full_link='https://tieba.baidu.com'+str(href)
        print(3)
        print(full_link)

结果:(由于限制,我删除了一些代码!)

    1
    <!DOCTYPE html>
    <!--STATUS OK-->
    <html>
    ...
<div class="threadlist_lz clearfix">
                <div class="threadlist_title pull_left j_th_tit 
">
    <i class="icon-member-top" alt="会员置顶" title="会员置顶" ></i><i class="icon-good" alt="精品" title="精品" ></i>

    <a rel="noreferrer" href="/p/5006374769" title="【答疑解惑】误删误封绿色通道" target="_blank" class="j_th_tit ">【答疑解惑】误删误封绿色通道</a>
</div><div class="threadlist_author pull_right">

...




2
[]
4

1 个答案:

答案 0 :(得分:1)

您的XPath表达式的@class属性错误。将其更改为ValidarSections(){ if(global.titulo === "Telefonia - Implementaciones"){ return [ { title: "Milestone", content: this.state.Milestone } ] } if(global.titulo === "Telefonia - Integraciones"){ return [ { title: "Relevamiento", content: this.state.RelevamientoINT }, { title: "Instalaciones", content: this.state.Instalaciones }, { title: "Integraciones", content: this.state.Integracion } ] } if(global.titulo === "Obras Civiles"){ return [ { title: "Obra", content: this.state.Obra }, { title: "Relevamiento", content: this.state.RelevamientoOBR } ] } } render() { const SECTIONS = this.ValidarSections() .... } (带有尾随空格),它将匹配。

j_th_tit

为避免这些错误,通常最好使用//div[@class="threadlist_lz clearfix"]/div/a[@class="j_th_tit "]/@href 函数,如

contains(...)

这种方法不太精确,但在大多数情况下足够。