xpath选择带条件的节点

时间:2015-07-10 09:46:49

标签: xpath scrapy

请使用Scrapy基于python的框架来抓取网站,但我无法弄清楚如何选择类value ellipsis ph的文本。有时在课堂上有一个很强的标签。到目前为止,我已经成功提取了没有strong的子标记的文本。

<div class="right">
    <div class="attrs">
        <div class="attr">
            <span class="name">Main Products:</span>
                <div class="value ellipsis ph">
 // Here below i needed to select it ignoring the strong tag
                    <strong>Shoes</strong> 
                    (Sport
                    <strong>Shoes</strong>
                    ,Casual
                    <strong>Shoes</strong>
                    ,Hiking
                    <strong>Shoes</strong>
                    ,Skate
                    <strong>Shoes</strong>
                    ,Football
                    <strong>Shoes</strong>
                    )
                </div>
        </div>
    </div>
</div>


<div class="right">
    <div class="attrs">
        <div class="attr">
            <span class="name">Main Products:</span>
                <div class="value ellipsis ph">
                    Cap, Shoe, Bag // could select this

                </div>
        </div>
    </div>
</div>

从所选节点的根目录开始,这是有效的。只选择没有强标记的文本。

"/div[@class='right']/div[@class='attrs']/div[@class='attr']/div/text()").extract()

2 个答案:

答案 0 :(得分:2)

正如@ splash58在评论中写的那样

//div[@class="value ellipsis ph"]//text()

XPath获取两个文本内容。当然,在第一部分中,它是一个文本列表 - 但是它们包含<strong>标签中的文本以及它们之外的文本。因为text()获取子树内的所有文本内容 - 即使有更多子标记可用。

答案 1 :(得分:2)

假设您想要div元素与value ellipsis ph类的文本表示,您可以:

  • 使用.//text()
  • 选择所有后代文本节点,而不仅仅是子项
  • 或在div元素
  • 上使用XPath的字符串函数

以下是两个选项:

>>> selector = scrapy.Selector(text="""<div class="right">
...     <div class="attrs">
...         <div class="attr">
...             <span class="name">Main Products:</span>
...                 <div class="value ellipsis ph">
...  <!-- // Here below i needed to select it ignoring the strong tag -->
...                     <strong>Shoes</strong> 
...                     (Sport
...                     <strong>Shoes</strong>
...                     ,Casual
...                     <strong>Shoes</strong>
...                     ,Hiking
...                     <strong>Shoes</strong>
...                     ,Skate
...                     <strong>Shoes</strong>
...                     ,Football
...                     <strong>Shoes</strong>
...                     )
...                 </div>
...         </div>
...     </div>
... </div>
... 
... 
... <div class="right">
...     <div class="attrs">
...         <div class="attr">
...             <span class="name">Main Products:</span>
...                 <div class="value ellipsis ph">
...                     Cap, Shoe, Bag <!-- // could select this -->
... 
...                 </div>
...         </div>
...     </div>
... </div>""")
>>> for div in selector.css('div.value.ellipsis.ph'):
...     print "---"
...     print "".join(div.xpath('.//text()').extract())
... 
---


                    Shoes 
                    (Sport
                    Shoes
                    ,Casual
                    Shoes
                    ,Hiking
                    Shoes
                    ,Skate
                    Shoes
                    ,Football
                    Shoes
                    )

---

                    Cap, Shoe, Bag 


>>> for div in selector.css('div.value.ellipsis.ph'):
...     print "---"
...     print div.xpath('string()').extract_first()
... 
---


                    Shoes 
                    (Sport
                    Shoes
                    ,Casual
                    Shoes
                    ,Hiking
                    Shoes
                    ,Skate
                    Shoes
                    ,Football
                    Shoes
                    )

---

                    Cap, Shoe, Bag 


>>> 
相关问题