Question

我正在使用教程学习scrapy：http://doc.scrapy.org/en/1.0/intro/tutorial.html

当我在教程中运行以下示例脚本时。我发现即使它已经遍历选择器列表，我从sel.xpath('a/text()').extract()得到的磁贴仍然是一个包含一个字符串的列表。与[u'Python 3 Object Oriented Programming']而非u'Python 3 Object Oriented Programming'相同。在后面的示例中，列表被分配给项item['title'] = sel.xpath('a/text()').extract()，我认为这在逻辑上是不正确的。

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            title = sel.xpath('a/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('text()').extract()
            print title, link, desc

但是，如果我使用以下代码：

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            link = href.extract()
            print(link)

link是字符串而不是列表。

这是一个错误还是打算？

Answer 1

.xpath().extract()和.css().extract()会返回一个列表，因为.xpath()和.css()会返回SelectorList个对象。

请参阅https://parsel.readthedocs.org/en/v1.0.1/usage.html#parsel.selector.SelectorList.extract

（SelectorList）.extract（）：

为每个元素调用.extract（）方法就是这个列表，并将它们的结果作为unicode字符串列表返回。

.extract_first()正是您所寻找的（记录很少）

取自http://doc.scrapy.org/en/latest/topics/selectors.html：

如果您只想提取第一个匹配的元素，可以调用选择器.extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

在你的另一个例子中：

def parse(self, response):
    for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
        link = href.extract()
        print(link)

循环中的每个href都是Selector个对象。在其上调用.extract()将获得一个Unicode字符串：

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/"
2016-02-26 12:11:36 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
In [1]: response.css("ul.directory.dir-col > li > a::attr('href')")
Out[1]: 
[<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
 <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>,
 ...
 <Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>]

.css()上的response返回SelectorList：

In [2]: type(response.css("ul.directory.dir-col > li > a::attr('href')"))
Out[2]: scrapy.selector.unified.SelectorList

循环该对象会为您提供Selector个实例：

In [5]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
   ...:     print href
   ...:     
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>
(...)
<Selector xpath=u"descendant-or-self::ul[@class and contains(concat(' ', normalize-space(@class), ' '), ' directory ') and (@class and contains(concat(' ', normalize-space(@class), ' '), ' dir-col '))]/li/a/@href" data=u'/Computers/Programming/Languages/Python/'>

调用.extract()会给你一个Unicode字符串：

In [6]: for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
    print type(href.extract())
   ...:     
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>
<type 'unicode'>

注意：.extract()上的Selector为wrongly documented，表示返回字符串列表。我将在parsel上打开一个问题（与Scrapy选择器相同，并在scrapy 1.1 +中使用）

为什么选择器循环中的xpath仍然返回教程中的列表

1 个答案: