如何在Scrapy Crawler中关注下一页以废弃内容

时间:2016-02-10 07:22:47

标签: python-2.7 scrapy web-crawler

我能够从第一页中删除所有故事,我的问题是如何移动到下一页并继续抓取故事和名称,请检查下面的代码

# -*- coding: utf-8 -*-
import scrapy
from cancerstories.items import CancerstoriesItem
class MyItem(scrapy.Item):
    name = scrapy.Field()
    story = scrapy.Field()
class MySpider(scrapy.Spider):

    name = 'cancerstories'
    allowed_domains = ['thebreastcancersite.greatergood.com']
    start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"story")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        #myItem['name']=response.xpath('//h1[@class="headline"]/text()').extract()
        text_raw = response.xpath('//div[@class="photoStoryBox"]/div/p/text()').extract() # extract the story (text)
        myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item

2 个答案:

答案 0 :(得分:2)

您可以为CrawlSpider更改Rule,并使用LinkExtractor... from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor ... rules = ( Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')), ) ... class MySpider(CrawlSpider): ... 关注指向下一页的链接。

对于这种方法,您必须包含以下代码:

SgmlLinkExtractor

这样,对于您访问的每个页面,spider将创建对下一页(如果存在)的请求,在完成解析方法的执行时跟随它,并再次重复该过程。

修改

我写的规则只是按照下一页的链接而不是提取故事,如果你的第一种方法有效,就没有必要改变它。

此外,关于评论中的规则,attrs已弃用,因此我建议您使用默认的link extractor,并且规则本身没有明确定义。

如果未定义提取器中的参数href,它会搜索正在查找正文中../story/mother-of-4435标记的链接,在这种情况下,这些标记看起来像/clickToGive/bcs/story/mother-of-4435而不是{{1 }}。这就是它找不到任何链接的原因。

答案 1 :(得分:0)

如果要使用scrapy.spider类,可以手动关注下一页,例如: next_page = response.css('a.pageLink :: attr(href)')。extract_first() 如果next_page:        absolute_next_page_url = response.urljoin(next_page)        yield scrapy.Request(url = absolute_next_page_url,callback = self.parse) 如果要使用CralwSpider类,请不要忘记将解析方法重命名为parse_start_url