Question

我一直在Scrapy中运行一个爬虫来抓取一个大型网站，我不想提及。我使用教程蜘蛛作为模板，然后我创建了一系列启动请求并让它从那里爬行，使用类似这样的东西：

def start_requests(self):
        f = open('zipcodes.csv', 'r')
        lines = f.readlines()
        for line in lines:
            zipcode = int(line)
            yield self.make_requests_from_url("http://www.example.com/directory/%05d" % zipcode)

首先，有超过10,000个这样的页面，然后每个页面都排队一个非常大的目录，还有几个要排队的页面等等，而scrapy似乎想留下来很浅，＆＃34;累积请求在内存中等待而不是钻取它们然后备份。

结果是一个重复的大异常，结尾如下：

  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
    yield next(it)

.....（更多行）.....

  File "C:\Python27\lib\site-packages\scrapy\selector\lxmldocument.py", line 13, in _factory
    body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
exceptions.MemoryError:

相当快的是，在一个小时左右的爬虫应该需要几天时间，python可执行文件将增加到1.8gigs并且Scrapy不再具有功能（继续花费我很多浪费了很多浪费美元的代理使用费！）。

有没有办法让Scrapy出局或外化或迭代（我甚至不知道正确的单词）存储的请求以防止出现这样的内存问题？

（我不是非常精通编程，除了拼凑我在这里或在文档中看到的内容，所以我没有能力在引擎盖下进行故障排除，可以这么说 - 我也是经过几天的尝试和阅读后，无法在W7上安装完整的python / django / scrapy为64位。）

Answer 1

在整个互联网上递归链接时，您将无法达到关闭状态。您需要以某种方式限制递归。不幸的是，您没有显示代码中您要执行此操作的部分。最简单的方法是将一个固定大小设置为要爬网的待处理链接列表，只是在它小于此上限之前不再添加到列表中。更高级的解决方案将根据父页面中的周围上下文为挂起的链接分配优先级，然后对挂起的固定最大大小的挂起链接优先级列表进行排序。

但是，您应该看看内置设置是否可以完成您想要的任务，而不是尝试编辑或破解现有代码。请参阅此文档页面以供参考：http://doc.scrapy.org/en/latest/topics/settings.html。看起来值为1或更大的DEPTH_LIMIT设置会限制起始页面的递归深度。

Answer 2

您可以批量处理您的网址，每次蜘蛛闲置时只排队几次。这避免了大量请求在内存中排队。以下示例仅从数据库/文件中读取下一批URL，并在完成所有先前请求处理后将它们排队为请求。

有关spider_idle信号的更多信息：http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle

有关调试内存泄漏的更多信息：http://doc.scrapy.org/en/latest/topics/leaks.html

from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher


class ExampleSpider(Spider):
    name = "example"
    start_urls = ['http://www.example.com/']

    def __init__(self, *args, **kwargs):
        super(ExampleSpider, self).__init__(*args, **kwargs)
        # connect the function to the spider_idle signal
        dispatcher.connect(self.queue_more_requests, signals.spider_idle)

    def queue_more_requests(self, spider):
        # this function will run everytime the spider is done processing
        # all requests/items (i.e. idle)

        # get the next urls from your database/file
        urls = self.get_urls_from_somewhere()

        # if there are no longer urls to be processed, do nothing and the
        # the spider will now finally close
        if not urls:
            return

        # iterate through the urls, create a request, then send them back to
        # the crawler, this will get the spider out of its idle state
        for url in urls:
            req = self.make_requests_from_url(url)
            self.crawler.engine.crawl(req, spider)

    def parse(self, response):
        pass

Scrapy内存错误（请求太多）Python 2.7

2 个答案: