限制在无响应的URL上的wget等待时间

时间:2018-02-08 09:58:18

标签: bash web-scraping wget

如何限制wget等待每个网址响应的时间?

这是我第一次使用bash所以如果这是一个基本问题,请原谅我。我正在尝试将图像网上的鹌鹑图片下载到Paperspace上的虚拟机上。这里列出了图像的URL:

    http://image-net.org/api/text/imagenet.synset.geturls?wnid=n01804478

我正在使用wget并使用以下命令:

    wget -H -k -e robots=on -P ~/data/quails/train/cail_quail/ -i http://image-net.org/api/text/imagenet.synset.geturls?wnid=n01804478

我发现至少有一个(可能更多)URL没有响应,递归停止,等待响应。我想在短时间(5秒后)之后跳过这些URL。

我需要跳过的网址示例:         http://images.encarta.msn.com/xrefmedia/sharemed/targets/images/pho/t049/T049952B.jpg

感谢任何指针。

1 个答案:

答案 0 :(得分:2)

来自wget手册页:

   -t number
   --tries=number
       Set number of tries to number. Specify 0 or inf for infinite retrying.  The default is to retry 20 times, with the exception of fatal errors like "connection refused" or "not found" (404), which are not retried.


   -T seconds
   --timeout=seconds
       Set the network timeout to seconds seconds.  This is equivalent to specifying --dns-timeout, --connect-timeout, and --read-timeout, all at the same time.

       When interacting with the network, Wget can check for timeout and abort the operation if it takes too long.  This prevents anomalies like hanging reads and infinite connects.  The only timeout enabled by default is a
       900-second read timeout.  Setting a timeout to 0 disables it altogether.  Unless you know what you are doing, it is best not to change the default timeout settings.

       All timeout-related options accept decimal values, as well as subsecond values.  For example, 0.1 seconds is a legal (though unwise) choice of timeout.  Subsecond timeouts are useful for checking server response times or
       for testing network latency.

所以,如果你能做到的话。

wget -H -k -e robots=on -P ~/data/quails/train/cail_quail/ -i http://image-net.org/api/text/imagenet.synset.geturls?wnid=n01804478 -T 5 -t 1

5秒后超时,不重试