Question

我想抓取http://news.qq.com/

标记下的所有相对网址

我的代码就是：

import scrapy
from scrapy.selector import Selector  
from homework.items import HomeworkItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

class News1Spider(scrapy.Spider):
    name = "News1"
    allowed_domains = ["http://news.qq.com/"]
    start_urls = (
        'http://news.qq.com/',
    )
    rules = (
        Rule(LxmlLinkExtractor(restrict_xpaths='//div[@class="Q-   tpList"]/div/a/@href'),callback='parse'),
    )

    def parse(self, response):
        sel = Selector(response)
    #lis = sel.xpath('//div[@class="Q-tpList"]')
    #item = TutorialItem()
    #for li in lis:
        title = sel.xpath('//div[@id=C-Main-Article-QQ]/div[1]/text()').extract()
        content =sel.xpath('//div[@id=Cnt-Main-Article-QQ]/p/text()').extract()
        print title

运行cmd scrapy时抓取News1

我无法在命令窗口中获得标题，你能告诉我如何修改它，为什么？感谢

Answer 1

您是Spider的子类，但由于您有start_urls，我认为您打算使用CrawlSpider。在这种情况下，您需要修改结构，因为parse实际上是由CrawlSpider在内部使用来查找要抓取的新链接：

rules = (
    Rule(LxmlLinkExtractor(restrict_xpaths='//div[@class="Q-   tpList"]/div/a/@href'), callback='parse_page'),
)

def parse_page(self, response):
    ...

您应该修复此类名并删除空格：

//div[@class="Q-   tpList"]/div/a/@href
                ^^^

最后，我认为你正在使用旧版本的Scrapy。我建议您在使用旧API编写更多代码之前立即升级，因为以后切换起来会更难。

Answer 2

首先你要么使用一个过时的scrapy版本，要么导入不好，因为现在scrapy只有一种类型的链接提取器 - LinkExtractor（更名为LxmlExtractor）

我已对此进行了测试，并且完美无缺：

$ scrapy shell 'http://news.qq.com/'
from scrapy.linkextractors import LinkExtractor
LinkExtractor(restrict_xpaths=['//div[@class="Q-tpList"]/div/a']).extract_links(response)
# got 43 results

注意xpath @class中没有空格检查，它指向a节点而不是@href属性，因为LinkExtractor提取节点而不是参数。

scrapy LxmlLinkExtractor抓取相对网址

2 个答案: