我尝试使用ImagesPipeline下载图片,结果只能得到一张图片(最后一张);看截图:
我的目标网站是 https://sc.chinaz.com/tupian/
你可以查看我的代码:
#This is spider:
import scrapy
from imgPro.items import ImgproItem
from PIL import Image
class ImgSpider(scrapy.Spider):
name = 'img'
#allowed_domains = ['www.xxx.com']
start_urls = ['https://sc.chinaz.com/tupian/']
def parse(self, response):
div_list = response.xpath('//*[@id="container"]/div')
# print(div_list)
# url = 'https://sc.chinaz.com'
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
item = ImgproItem()
item['src'] = img_src
yield item
这是我的管道:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class imagePipeLine(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(item['src'])
def file_path(self, request, response=None, info=None, *, item=None):
imag_name = request.url.split('/')[-1]
return imag_name
def item_completed(self, results, item, info):
return item
我应该更改什么才能获得所有图像?
答案 0 :(得分:0)
在 for 循环的 parse()
中,您遍历所有图像的列表,但在循环之后,只有最后一个保存在 img_src
中,并且永远不会返回到前一个。因此,您要么需要在获得各自的 img_src 后立即处理每个图像:
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
# now process this image
或者将它们全部保存在一个列表中并稍后处理整个列表:
all_img_srcs = []
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
all_img_srcs.append(img_src)
# now process all the images on the list
也许
def parse(self, response):
div_list = response.xpath('//*[@id="container"]/div')
items = []
for div in div_list:
img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
print(img_src)
item = ImgproItem()
item['src'] = img_src
items.append(item)
yield items