如何在一定时间后安排scrapy蜘蛛爬行?

时间:2016-06-19 05:18:30

标签: scrapy scrapy-spider

我想安排我的蜘蛛在爬行完成后的1小时内再次运行。在我的代码中spider_closed方法在抓取结束后调用。现在如何从这个方法再次运行蜘蛛。或者是否有任何可用的设置来安排scrapy蜘蛛。

这是我的基本蜘蛛代码。

import scrapy
import codecs
from a2i.items import A2iItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher


class A2iSpider(scrapy.Spider):
    name = "notice"
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()
    allowed_domains = ["prothom-alo.com"]

    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def parse(self, response):

        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            print "*"*70
            print url
            print "\n\n"
            yield scrapy.Request(url, callback=self.parse_page,meta={'depth':2,'url' : url})


    def parse_page(self, response):
        filename = "response.txt"
        depth = response.meta['depth']

        with open(filename, 'a') as f:
            f.write(str(depth))
            f.write("\n")
            f.write(response.meta['url'])
            f.write("\n")

        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_page,meta={'depth':depth+1,'url' : url})


    def spider_closed(self, spider):
        print "$"*2000

2 个答案:

答案 0 :(得分:1)

您可以使用cron

crontab -e以root身份创建计划并运行脚本,或 crontab -u [user] -e以特定用户身份运行。

在底部你可以添加 0 * * * * cd /path/to/your/scrapy && scrapy crawl [yourScrapy] >> /path/to/log/scrapy_log.log

0 * * * *使脚本每小时运行一次,我相信您可以在线找到有关设置的更多详细信息。

答案 1 :(得分:0)

您可以使用JOBDIR设置运行您的Spider,它将保存您在调度程序中加载的请求

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

https://doc.scrapy.org/en/latest/topics/jobs.html