Question

我有两只蜘蛛，它们捕获主蜘蛛抓取的网址和数据。我的方法是在主蜘蛛中使用CrawlerProcess并将数据传递给两个蜘蛛。这是我的方法：

class LightnovelSpider(scrapy.Spider):

    name = "novelDetail"
    allowed_domains = ["readlightnovel.com"]

    def __init__(self,novels = []):
        self.novels = novels

    def start_requests(self):
        for novel in self.novels:
            self.logger.info(novel)
            request = scrapy.Request(novel, callback=self.parseNovel)
            yield request

    def parseNovel(self, response):
        #stuff here

class chapterSpider(scrapy.Spider):
    name = "chapters"
    #not done here

class initCrawler(scrapy.Spider):
    name = "main"
    fromMongo = {}
    toChapter = {}
    toNovel = []
    fromScraper = []


    def start_requests(self):
        urls = ['http://www.readlightnovel.com/novel-list']

        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self,response):

        for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
            initCrawler.fromScraper.append(novel)

        self.checkchanged()

    def checkchanged(self):
        #some scraped data processing here
        self.dispatchSpiders()

    def dispatchSpiders(self):
        process = CrawlerProcess()
        novelSpider = LightnovelSpider()
        process.crawl(novelSpider,novels=initCrawler.toNovel)
        process.start()
        self.logger.info("Main Spider Finished")

我运行“scrapy crawl main”并获得一个漂亮的错误

我能看到的主要错误是“twisted.internet.error.ReactorAlreadyRunning”。我不知道。是否有更好的方法从另一个蜘蛛运行多个蜘蛛和/或我怎么能停止这个错误？

Answer 1

哇，不知道这样的事情可行，但我从来没有尝试过。

当多个抓取阶段必须携手合作时，我正在做的是这两个选项中的一个：

选项1 - 使用数据库

当刮刀必须以连续模式运行，重新扫描网站等时，我会让刮刀将其结果推送到数据库中（通过管道）

进行后续处理的蜘蛛也会从同一个数据库中提取他们需要的数据（例如，在你的情况下是小说网址）。

然后使用调度程序或cron保持一切运行，蜘蛛将携手合作。

选项2 - 将所有内容合并为一个蜘蛛

当我需要将所有内容作为一个脚本运行时，我选择的方式是：我创建了一个将多个请求步骤链接在一起的蜘蛛。

class LightnovelSpider(scrapy.Spider):

    name = "novels"
    allowed_domains = ["readlightnovel.com"]

    # was initCrawler.start_requests
    def start_requests(self):
        urls = ['http://www.readlightnovel.com/novel-list']

        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse_novel_list)

    # a mix of initCrawler.parse and parts of LightnovelScraper.start_requests
    def parse_novel_list(self,response):
        for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract():
            yield Request(novel, callback=self.parse_novel)

    def parse_novel(self, response):
        #stuff here
        # ... and create requests with callback=self.parse_chapters

    def parse_chapters(self, response):
        # do stuff

（代码未经过测试，只是为了展示基本想法）

如果事情变得太复杂，我会抽出一些元素并将它们移到mixin类中。

在你的情况下，我很可能更喜欢选项2。

Answer 2

经过一番研究后，我能够通过使用房产装饰＆＃34; @ property＆＃34;来解决这个问题。从这样的主蜘蛛中检索数据：

class initCrawler(scrapy.Spider):

    #stuff here from question

    @property
    def getNovel(self):
        return self.toNovel

    @property
    def getChapter(self):
        return self.toChapter

然后像这样使用CrawlerRunner：

from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler
from scrapy.crawler import CrawlerProcess,CrawlerRunner
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
import logging

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(initCrawler)
    toNovel = initCrawler.toNovel
    toChapter = initCrawler.toChapter
    yield runner.crawl(chapterSpider,chapters=toChapter)
    yield runner.crawl(lightnovelSpider,novels=toNovel)

    reactor.stop()

crawl()
reactor.run()

Scrapy从主蜘蛛中运行多个蜘蛛？

2 个答案: