我只是想对网站进行爬网,以便我可以在这些网站上查找内容以及下载文件,因此我使用wget
来进行爬网。我知道还有其他平台也可以执行此操作,但是它们远要复杂得多。例如,我只想抓取example.com
并解析网站的内容。
但是,当我尝试这样做时,某些URL具有301重定向,而wget
似乎无法正确处理这些重定向。例如:
wget -r https://example.com
--2020-02-27 21:23:50-- https://example.com/
Resolving example.com (example.com)... 104.31.69.85, 103.31.68.90
Connecting to example.com (example.com)|104.31.69.85|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.example.com/ [following]
--2020-02-27 21:23:51-- http://www.example.com/
Resolving www.example.com (www.example.com)... 103.31.68.90, 104.31.69.85
Connecting to www.example.com (www.example.com)|103.31.68.90|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.example.com/ [following]
--2020-02-27 21:23:51-- https://www.example.com/
Connecting to www.example.com (www.example.com)|103.31.68.90|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘example.com/index.html’
example.com/index.html [ <=> ] 129.68K --.-KB/s in 0.003s
2020-02-27 21:23:51 (45.9 MB/s) - ‘example.com/index.html’ saved [132797]
FINISHED --2020-02-27 21:23:51--
Total wall clock time: 0.3s
Downloaded: 1 files, 130K in 0.003s (45.9 MB/s)
在上述情况下,example.com
仅为示例的替换域。它似乎并没有遵循example.com
到www.example.com
,并且我不希望每个URL都需要www
,因为有些人更喜欢并处理example.com
而不是{{ 1}}
是否可以通过www.example.com
完成以下重定向?我看到有一个wget
选项,但它似乎甚至没有遵循第一个重定向,因此不太确定如何解决此问题。