生成起始URL时发生冲突

时间:2016-04-29 19:52:47

标签: scrapy scrapy-spider

我正致力于从国家艺术馆的在线目录中检索信息。由于目录的结构,我不能通过提取和跟踪从入口到入口的链接来导航。幸运的是,集合中的每个对象都有一个可预测的URL。我希望我的蜘蛛通过生成开始网址来导航集合。

我试图通过从this线程实现解决方案来解决我的问题。不幸的是,这似乎打破了我蜘蛛的另一部分。错误日志显示我的网址已成功生成,但它们未正确处理。如果我正确地解释日志 - 我怀疑我没有 - 重新定义start_urls之间存在冲突,这允许我生成我需要的URL和蜘蛛的规则部分。现在的情况是,蜘蛛也不会尊重我要求它爬行的页数。

你会发现我的蜘蛛和下面的典型错误。我感谢您提供的任何帮助。

蜘蛛:

URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 10
class NGASpider(CrawlSpider):
    name = 'ngamedallions'
    allowed_domains = ['nga.gov']
    start_urls = [URL % starting_number]
    rules = (
            Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))

    def __init__(self):
        self.page_number = starting_number

    def start_requests(self):
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i + ".html" , callback=self.parse)


    def parse_CatalogRecord(self, response):
        CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
        CatalogRecord.default_output_processor = TakeFirst()
        CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
        keywords = "medal|medallion"
        r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        if r.search(response.body_as_unicode()):

            CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
            CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
            CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()')
            CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')

            return CatalogRecord.load_item()

典型错误:

2016-04-29 15:35:00 [scrapy] ERROR: Spider error processing <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1178.html> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
  File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 73, in _parse_response
    for request_or_item in self._requests_to_follow(response):
   File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 51,  in _requests_to_follow
    for n, rule in enumerate(self._rules):
AttributeError: 'NGASpider' object has no attribute '_rules'

将Resonse更新为eLRuLL的解决方案

只需删除def __init__start_urls即可让我的蜘蛛抓取生成的网址。但是,它似乎也阻止了“def parse_CatalogRecord(self,response)&#39;从被应用。当我现在运行蜘蛛时,它只会从生成的URL范围之外擦除页面。我修改过的蜘蛛和日志输出如下所示。

蜘蛛:

URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 1311
class NGASpider(CrawlSpider):
    name = 'ngamedallions'
    allowed_domains = ['nga.gov']
    rules = (
            Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))

    def start_requests(self):
        self.page_number = starting_number
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i + ".html" , callback=self.parse)


    def parse_CatalogRecord(self, response):
        CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
        CatalogRecord.default_output_processor = TakeFirst()
        CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
        keywords = "medal|medallion"
        r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        if r.search(response.body_as_unicode()):

            CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
            CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
            CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()')
            CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')

            return CatalogRecord.load_item()

日志:

2016-05-02 15:50:02 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-02 15:50:02 [scrapy] INFO: Optional features available: ssl, http11
2016-05-02 15:50:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-05-02 15:50:02 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-02 15:50:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-02 15:50:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-02 15:50:02 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-05-02 15:50:02 [scrapy] INFO: Spider opened
2016-05-02 15:50:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-02 15:50:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-02 15:50:02 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-05-02 15:50:02 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-05-02 15:50:05 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-05-02 15:50:05 [scrapy] DEBUG: File (uptodate): Downloaded image from <GET http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg> referred in <None>
2016-05-02 15:50:05 [scrapy] DEBUG: Scraped from <200 http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html>
{'accession': u'1942.9.163.b',
'image_urls': [u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'],
 'images': [{'checksum': '9d5f2e30230aeec1582ca087bcde6bfa',
         'path': 'full/3a692347183d26ffefe9ba0af80b0b6bf247fae5.jpg',
         'url': 'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'}],
 'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
 'title': u'House between Two Hills [reverse]'}
2016-05-02 15:50:05 [scrapy] INFO: Closing spider (finished)
2016-05-02 15:50:05 [scrapy] INFO: Stored json feed (1 items) in: items.json
2016-05-02 15:50:05 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 631,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 26324,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'dupefilter/filtered': 3,
 'file_count': 1,
 'file_status_count/uptodate': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 5, 2, 19, 50, 5, 810570),
 'item_scraped_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 8,
 'request_depth_max': 2,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 5, 2, 19, 50, 2, 455508)}
2016-05-02 15:50:05 [scrapy] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:1)

如果您不打电话给__init__,请不要覆盖super方法。

现在,如果您打算使用start_urls,则不需要声明start_requests让您的蜘蛛正常工作。

只需删除def __init__方法即可,无需start_urls存在。

<强>更新

好的我的错误,看起来CrawlSpider需要start_urls属性,所以只需创建它而不是使用start_requests方法:

start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]

并删除start_requests

相关问题