Question

我有以下简单的嵌套结构：

<main>
    <em>bla-bla</em>

    <div class="1">1.1</div>

    <div class="2">2.1</div>

    <div class="2">2.2</div>

    <div class="1">1.2</div>

    <div class="2">
        <span>
            <em>2.3</em>
        </span>
    </div>

    <div class="2">2.4</div>

</main>

我现在想从所有节点中提取所有文本，但是要与嵌套节点（等等）作斗争。

预期输出应为：

2.1
2.2
2.3
2.4

尝试类似的事情：

//div[contains(@class,"2")]/text()

给予

2.1
2.2
<div class="2"><span><em>2.3</em></span></div>
<div class="2"><span><em>2.3</em></span></div>
2.4

我也没有尝试使用直接的XPATH，而是尝试在Scrapy中使用几个步骤，例如：

divs = response.xpath("//div[contains(@class,"2")]")

for div in divs:
   # now check somehow that the div contains an "em" node

使用

div.xpath("//em")

不起作用，因为它提供了所有节点。当然，在这里使用div.extract（）并查看返回的字符串，我当然可以使用字符串搜索找到它，但这是一个hack，看起来不像是正确的Scrapy解决方案。

任何建议直接使用Xpath或通常使用Scrapy解决此问题的建议都将受到赞赏。

Answer 1

您对[i.strip() for i in response.xpath('//div[contains(@class, "2")]//text()').extract() if i.strip()]的看法如何？

不进行剥离也会给出一些空的情况：

>>> response.xpath('//div[contains(@class, "2")]//text()').extract()
[u'2.1', u'2.2', u'\n        ', u'\n            ', u'2.3', u'\n        ', u'\n    ', u'2.4']

所以我用strip过滤了它们：

>>> [i.strip() for i in response.xpath('//div[contains(@class, "2")]//text()').extract() if i.strip()]
[u'2.1', u'2.2', u'2.3', u'2.4']

Scrapy / XPATH：如何仅从后代和自己中提取文本

1 个答案: