Question

我有一个网站要刮。在主页面上它有故事戏弄 - 所以，这个页面将是我们的开始解析页面。我的蜘蛛从它那里收集并收集有关每个故事的数据 - 作者，评级，出版日期等。这是由蜘蛛正确完成的。

import scrapy
from scrapy.spiders import Spider
from sxtl.items import SxtlItem
from scrapy.http.request import Request


class SxtlSpider(Spider):
    name = "sxtl"

    start_urls = ['some_site']


    def parse(self, response):

        list_of_stories = response.xpath('//div[@id and @class="storyBox"]')

        item = SxtlItem()

        for i in list_of_stories:

            pre_rating = i.xpath('div[@class="storyDetail"]/div[@class="stor\
                yDetailWrapper"]/div[@class="block rating_positive"]/span/\
                text()').extract()
            rating = float(("".join(pre_rating)).replace("+", ""))

            link = "".join(i.xpath('div[@class="wrapSLT"]/div[@class="title\
                Story"]/a/@href').extract())

            if rating > 6:
                yield Request("".join(link), meta={'item':item}, callback=\
                                                            self.parse_story)
            else:
                break

    def parse_story(self, response):

        item = response.meta['item']

        number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
                                        [last()-1]/text()').extract()

        if number_of_pages:
            item['number_of_pages'] = int("".join(number_of_pages))
        else:
            item['number_of_pages'] = 1

        item['date'] = "".join(response.xpath('//span[@class="date"]\
                                                /text()').extract()).strip()
        item['author'] = "".join(response.xpath('//a[@class="author"]\
                                                /text()').extract()).strip()
        item['text'] = response.xpath('//div[@id="storyText"]/div\
                [@itemprop="description"]/text() | //div[@id="storyText"]\
                        /div[@itemprop="description"]/p/text()').extract()
        item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
                                            /a[@href]/@href').extract()

        yield item

因此，数据收集正确，但我们只有每个故事的第一页。但是每个sory都有几页（并且链接到第2页，第3页，第4页，有时是15页）。这就是问题出现的地方。我将yield item替换为:(获取每个故事的第2页）

yield Request("".join(item['list_of_links'][0]), meta={'item':item}, \
                                                callback=self.get_text)


def get_text(self, response):

    item = response.meta['item']

    item['text'].extend(response.xpath('//div[@id="storyText"]/div\
        [@itemprop="description"]/text() | //div[@id="storyText"]\
                /div[@itemprop="description"]/p/text()').extract())

    yield item

蜘蛛收集下一页（第二页），但它将它们加入到任何故事的第一页。例如，第一个故事的第二页可以添加到第四个故事。第5个故事的第2页被添加到第1个故事中。等等。

请帮助，如果要扫描的数据分布在多个网页上，如何将数据收集到一个项目（一个字典）中？（在这种情况下 - 如何不让来自不同项目的数据相互混合？）

感谢。

Answer 1

从技术上讲： -

1）刮故事第1页 2）检查是否有更多页面 3）如果没有，只需yield项 4）如果它有下一页按钮/链接，刮掉该链接并将整个数据字典传递给下一个回调方法。

def parse_story(self, response):

    item = response.meta['item']

    number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\
                                    [last()-1]/text()').extract()

    if number_of_pages:
        item['number_of_pages'] = int("".join(number_of_pages))
    else:
        item['number_of_pages'] = 1

    item['date'] = "".join(response.xpath('//span[@class="date"]\
                                            /text()').extract()).strip()
    item['author'] = "".join(response.xpath('//a[@class="author"]\
                                            /text()').extract()).strip()
    item['text'] = response.xpath('//div[@id="storyText"]/div\
            [@itemprop="description"]/text() | //div[@id="storyText"]\
                    /div[@itemprop="description"]/p/text()').extract()
    item['list_of_links'] = response.xpath('//div[@class="pNavig"]\
                                        /a[@href]/@href').extract()

    # if it has NEXT PAGE button
    if nextPageURL > 0:
        yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item})
    else:
        # it has no more pages, so just yield data.
        yield item





def get_text(self, response):

    item = response.meta['item']


    # merge text here
    item['text'] = item['text'] + response.xpath('//div[@id="storyText"]/div\
        [@itemprop="description"]/text() | //div[@id="storyText"]\
                /div[@itemprop="description"]/p/text()').extract()


    # Now again check here if it has NEXT PAGE button call same function again.
    if nextPageURL > 0:
        yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item})
    else:
        # no more pages, now finally yield the ITEM
        yield item

Answer 2

经过多次尝试和阅读大量文档后，我找到了解决方案：

item = SxtlItem()

这个Item声明应该从parse函数移到parse_story函数的开头。应删除parse_story中的“item = response.meta ['item']”行。而且，当然，

yield Request("".join(link), meta={'item':item}, callback=self.parse_story)

中的

应更改为

yield Request("".join(link), callback=self.parse_story)

为什么呢？因为Item只被声明了一次，并且它的所有字段都被不断地重写。虽然文档中只有一个页面 - 看起来好像一切都好，就好像我们有一个“新”项目。但是当一个故事有几页时，这个项目会以一些混乱的方式被覆盖，我们会收到混乱的结果。简而言之：应该多次创建New Item，因为我们要保存很多项目对象。

将“item = SxtlItem（）”移动到正确位置后，一切都运行良好。

python scrapy从几个页面收集数据到一个项目（字典）

2 个答案: