如何在下面的场景中使用Scrapy获取第二页的内容?

时间:2017-05-14 13:53:11

标签: scrapy scrapy-spider

我有一个需要获取对象数组的蜘蛛,其中每个对象有5个项目。 4个项目在同一页面上,第5个项目是我需要从中提取数据并将所有5个项目作为文本返回的URL。在下面的代码段中,解释是另一页上的关键。我需要解析它并将其数据与其他属性一起添加,同时产生它。

导出到JSON文件时的当前解决方案显示如下。如你所知,我的“e”没有解决。我如何获取数据?

[
    {
        "q": "How many pairs of integers (x, y) exist such that the product of x, y and HCF (x, y) = 1080?",
        "c": [
            "8",
            "7",
            "9",
            "12"
        ],
        "a": "Choice (C).9",
        "e": "<Request GET http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/hcf-lcm_1.shtml>",
        "d": "Hard"
    }
]


class CatSpider(scrapy.Spider):
    name = "catspider"
    start_urls = [
        'http://iim-cat-questions-answers.2iim.com/quant/number-system/hcf-lcm/'
    ]

    def parse_solution(self, response):
        yield response.xpath('//p[@class="soln"]').extract_first()

    def parse(self, response):
        for lis in response.xpath('//ol[@class="z1"]/li'):
            questions = lis.xpath('.//p[@lang="title"]/text()').extract_first()
            choices = lis.xpath(
                './/ol[contains(@class, "xyz")]/li/text()').extract()
            answer = lis.xpath(
                './/ul[@class="exp"]/li/span/span/text()').extract_first()
            explanation = lis.xpath(
                './/ul[@class="exp"]/li[2]/a/@href').extract_first()
            difficulty = lis.xpath(
                './/ul[@class="exp"]/li[last()]/text()').extract_first()
            if questions and choices and answer and explanation and difficulty:
                yield {
                    'q': questions,
                    'c': choices,
                    'a': answer,
                    'e': scrapy.Request(response.urljoin(explanation), callback=self.parse_solution),
                    'd': difficulty
                }

1 个答案:

答案 0 :(得分:2)

Scrapy是一个异步框架,这意味着它的所有元素都不会被阻塞。所以Request作为一个对象什么都不做,它只存储scrapy下载器的信息,因此它意味着你不能只是调用它来下载你现在正在做的事情。

通常的解决方案是通过回调来携带数据来设计爬行链:

def parse(self, response):
    item = dict()
    item['foo'] = 'foo is great'
    next_page = 'http://nextpage.com'
    return Request(next_page,
                   callback=self.parse2,
                   meta={'item': item})  # put our item in meta

def parse2(self, response):
    item = response.meta['item']  # take our item from the meta
    item['bar'] = 'bar is great too!'
    return item