Question

我正在尝试开发一个应用程序，我将在Nutch中为urls文件提供一组受限制的url。我能够抓取这些网址，并通过从细分中读取数据来获取它们的内容。

我已经通过给出深度1来抓取，因为我不关心网页中的外链或链接。我只需要url文件中该网页的内容。

但执行此抓取需要时间。所以，建议我一种减少爬行时间和提高爬行速度的方法。我也不需要索引，因为我不关心搜索部分。

有没有人有关于如何加快抓取的建议？

Answer 1

获得速度的主要方面是配置nutch-site.xml

<property>
<name>fetcher.threads.per.queue</name>
   <value>50</value>
   <description></description>
</property>

Answer 2

您可以在nutch-site.xml中扩展线程。增加fetcher.threads.per.host和fetcher.threads.fetch都会提高爬网速度。我注意到了极大的改进。增加这些时要小心。如果您没有硬件或连接来支持这种增加的流量，则爬网中的错误数量会显着增加。

Answer 3

对我来说，这个属性给了我很多帮助，因为慢速域可以减缓所有获取阶段：

 <property>
  <name>generate.max.count</name>
  <value>50</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
 </property>

例如，如果您尊重robots.txt（默认行为）并且域太长而无法抓取，则延迟将为：fetcher.max.crawl.delay。并且队列中的很多域将减慢所有获取阶段，因此最好限制generate.max.count。

您可以添加此属性以限制获取阶段的时间：

<property>
  <name>fetcher.throughput.threshold.pages</name>
  <value>1</value>
  <description>The threshold of minimum pages per second. If the fetcher downloads less
  pages per second than the configured threshold, the fetcher stops, preventing slow queue's
  from stalling the throughput. This threshold must be an integer. This can be useful when
  fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check.
  </description>
</property>

但是请不要触摸fetcher.threads.per.queue属性，你将在黑名单中完成......这不是提高爬行速度的好方法......

Answer 4

你好我也是这个爬行的新手，但我已经使用了一些方法，我得到了一些好的结果，可能你会我已经使用这些属性更改了我的nutch-site.xml

<property>
  <name>fetcher.server.delay</name>
  <value>0.5</value>
 <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
 </description>

</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>400</value>
  <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).</description>
</property>


<property>
  <name>fetcher.threads.per.host</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

请提出更多选择感谢

Answer 5

如果您不需要关注链接，我认为没有理由使用Nutch。您只需获取您的网址列表，然后使用curl获取带有http客户端库或简单脚本的网址。

Answer 6

我有类似的问题，可以借助于提高速度 https://wiki.apache.org/nutch/OptimizingCrawls

它提供了有用的信息，可以减慢您的抓取速度，以及您可以采取哪些措施来改善这些问题。

不幸的是，在我的情况下，我的队列非常不平衡，并且无法向更大的队列请求太快，否则我会被阻止，所以我可能需要进入群集解决方案或TOR才能进一步加速线程。

如何加快Nutch的爬行速度

6 个答案: