Question

我尝试使用ImagesPipeline下载图片，结果只能得到一张图片（最后一张）；看截图：

你可以查看我的代码：

#This is spider:

import scrapy
from imgPro.items import ImgproItem
from PIL import Image


class ImgSpider(scrapy.Spider):
    name = 'img'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://sc.chinaz.com/tupian/']


    def parse(self, response):
        div_list = response.xpath('//*[@id="container"]/div')
        # print(div_list)
        # url = 'https://sc.chinaz.com'

        for div in div_list:
            
            img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
            print(img_src)

        item = ImgproItem()
        item['src'] = img_src

        
        yield item

这是我的管道：

import scrapy

from scrapy.pipelines.images import ImagesPipeline

class imagePipeLine(ImagesPipeline):

    def get_media_requests(self, item, info):
       yield scrapy.Request(item['src'])

    def file_path(self, request, response=None, info=None, *, item=None):
        imag_name = request.url.split('/')[-1]
        return imag_name

    def item_completed(self, results, item, info):
        return item

我应该更改什么才能获得所有图像？

Answer 1

在 for 循环的 parse() 中，您遍历所有图像的列表，但在循环之后，只有最后一个保存在 img_src 中，并且永远不会返回到前一个。因此，您要么需要在获得各自的 img_src 后立即处理每个图像：

for div in div_list:        
    img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
    print(img_src)
    # now process this image

或者将它们全部保存在一个列表中并稍后处理整个列表：

all_img_srcs = []
for div in div_list:
    img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
    print(img_src)
    all_img_srcs.append(img_src)

# now process all the images on the list

也许

def parse(self, response):
    div_list = response.xpath('//*[@id="container"]/div')

    items = []
    for div in div_list:            
        img_src = 'https:' + div.xpath('./div/a/img/@src2')[0].extract()
        print(img_src)
        item = ImgproItem()
        item['src'] = img_src
        items.append(item)   

    yield items

scrapy 只下载一张图片

1 个答案: