Question

我想在网站上有一个画廊的本地副本。图库在domain.com/id/1上显示图片（id以1为增量增加），然后图像存储在pics.domain.com/pics/original/image.format中。图像在HTML中的确切行是

<div id="bigwall" class="right"> 
    <img border=0 src='http://pics.domain.com/pics/original/image.jpg' name='pic' alt='' style='top: 0px; left: 0px; margin-top: 50px; height: 85%;'> 
</div>

所以我想编写一个类似这样的脚本（在伪代码中）：

for(id = 1; id <= 151468; id++) {
     page = "http://domain.com/id/" + id.toString();
     src = returnSrc(); // Searches the html for img with name='pic' and saves the image location as a string
     getImg(); // Downloads the file named in src
}

我不确定如何做到这一点。我想我可以用bash来做，使用wget下载html，然后手动搜索html http://pics.domain.com/pics/original/ 。然后再次使用wget保存文件，删除html文件，增量id并重复。唯一的问题是我不擅长处理字符串，所以如果有人能告诉我如何搜索url并用文件名和格式替换* s我应该能够完成其余的工作。或者，如果我的方法很愚蠢，你有一个更好的方法请分享。

Answer 1

# get all pages
curl 'http://domain.com/id/[1-151468]' -o '#1.html'

# get all images
grep -oh 'http://pics.domain.com/pics/original/.*jpg' *.html >urls.txt

# download all images
sort -u urls.txt | wget -i-

从网站下载图像

1 个答案: