如何处理Scrapy中的连接或下载错误?

时间:2018-10-11 10:27:54

标签: python-3.x scrapy twisted scrapy-spider

我正在使用以下方法检查spider.py中的(互联网)连接错误:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)

def handle_error(self, failure):
    if failure.check(DNSLookupError):   # or failure.check(UnknownHostError):
        request = failure.request
        self.logger.error('DNSLookupError on: %s', request.url)
        print("\nDNS Error! Please check your internet connection!\n")

    elif failure.check(HttpError):
        response = failure.value.response
        self.logger.error('HttpError on: %s', response.url)

    print('\nSpider closed because of Connection issues!\n')
    raise CloseSpider('Because of Connection issues!')
    ...

但是,当蜘蛛运行并且连接断开时,我仍然收到 Traceback (most recent call last): 消息。我想通过处理错误并正确关闭Spider来摆脱这种情况。

我得到的输出是:

2018-10-11 12:52:15 [NewAds] ERROR: DNSLookupError on: https://x.com

DNS Error! Please check your internet connection!

2018-10-11 12:52:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://x.com>

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3.6/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
    "no results for hostname lookup: {}".format(self._hostStr)

twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: x.com.

由此您可以注意到以下内容:

  1. 我能够部分处理(第一个?)DNSLookupError错误,但是...
  2. 关闭Spider的速度似乎不够快,因此Spider会继续尝试下载URL,从而导致其他错误(ERROR: Error downloading)。
  3. 可能导致第二个错误:twisted.internet.error.DNSLookupError:

如何处理[scrapy.core.scraper] ERROR: Error downloading并确保蜘蛛网正确关闭?

(或:如何在Spider启动时检查互联网连接?

3 个答案:

答案 0 :(得分:0)

好,我一直在尝试与Scrapy融洽相处,试图退出 如果没有互联网连接或其他错误,请正常显示。结果?我无法使其正常工作。相反,我最终只是使用 os._exit(0) 关闭了整个口译员和所有令人讨厌的延迟孩子,如下所示:

import socket
#from scrapy.exceptions import CloseSpider
...
def check_connection(self):
    try:
        socket.create_connection(("www.google.com", 443))
        return True
    except:
        pass
    return False

def start_requests(self):
    if not self.check_connection(): 
        print('Connection Lost! Please check your internet connection!', flush=True)
        os._exit(0)                     # Kill Everything
        #CloseSpider('Grace Me!')       # Close clean but expect deferred errors!
        #raise CloseSpider('No Grace')  # Raise Exception (w. Traceback)?!
    ...

做到了!


注意

我试图使用各种内部方法关闭Scrapy,并处理令人讨厌的事情:

[scrapy.core.scraper] ERROR: Error downloading

问题。当您使用:raise CloseSpider('Because of Connection issues!')和其他许多尝试时,只会发生(?)。再加上一个twisted.internet.error.DNSLookupError,即使我已经用 my 代码进行了处理,但它似乎也不是没有。显然,raise是手动always raise an exception的方式。因此,请改用CloseSpider()


当前的问题似乎也是Scrapy框架中经常发生的问题……实际上,源代码中包含一些FIXME的in there。即使我尝试应用类似的内容:

def stop(self):
    self.deferred = defer.Deferred()
    for name, signal in vars(signals).items():
        if not name.startswith('_'):
            disconnect_all(signal)
    self.deferred.callback(None)

并使用这些...

#self.stop()
#sys.exit()
#disconnect_all(signal, **kwargs)
#self.crawler.engine.close_spider(spider, 'cancelled')
#scrapy.crawler.CrawlerRunner.stop()
#crawler.signals.stop()

PS。如果Scrapy开发人员能够记录如何最好地处理这种简单情况(例如没有互联网连接),那将是很棒的事情。

答案 1 :(得分:0)

我相信我可能刚刚找到答案。要正常退出start_requests,请return []。这表明它没有要处理的请求。

要关闭蜘蛛,请在蜘蛛上调用close()方法: self.close('原因')

import logging
import scrapy
import socket


class SpiderIndex(scrapy.Spider):
    name = 'test'

    def check_connection(self):
        try:
            socket.create_connection(("www.google.com", 443))
            return True
        except Exception:
            pass
        return False

    def start_requests(self):
        if not self.check_connection():
            print('Connection Lost! Please check your internet connection!', flush=True)
            self.close(self, 'Connection Lost!')
            return []

        # Continue as normal ...
        request = scrapy.Request(url='https://www.google.com', callback=self.parse)
        yield request

    def parse(self, response):
        self.log(f'===TEST SPIDER: PARSE REQUEST======{response.url}===========', logging.INFO)

附录:出于某种奇怪的原因,在一只蜘蛛self.close('reason')上工作时,我不得不将其更改为self.close(self, 'reason')

答案 2 :(得分:0)

twist.def有一个类似的问题,在尝试关闭扭曲的连接后会捕获异常,这会阻止代码正常关闭。

所以,我弹出核心...

os._exit(0) 
相关问题