Question

我使用scrapy创建了一个蜘蛛，我正在尝试将下载链接保存到（python）列表中，以便稍后使用downloadlist[1]调用列表条目。

但是scrapy将url保存为项目而不是列表。有没有办法将每个网址附加到列表中？

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
import scrapy
from scrapy.linkextractors import LinkExtractor


DOMAIN = 'some-domain.com'
URL = 'http://' +str(DOMAIN) 


linklist = []

class subtitles(scrapy.Spider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

# First parse returns all the links of the website and feeds them to parse2 

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            yield Request(url, callback=self.parse2)

# Second parse selects only the links that contains download

    def parse2(self, response): 
        le = LinkExtractor(allow=("download"))
        for link in le.extract_links(response):
            yield Request(url=link.url, callback=self.parse2)
            print link.url

# prints list of urls, 'downloadlist' should be a list but isn't. 

downloadlist = subtitles()
print downloadlist

Answer 1

你误解了课程是如何工作的，你在这里称一个课程不是一个功能。

以这种方式思考，你在class MySpider(Spider)中定义的蜘蛛是scrapy引擎使用的模板;当你开始scrapy crawl myspider scrapy启动引擎并读取你的模板以创建一个将用于处理各种响应的对象。

所以你的想法可以简单地翻译成：

def parse2(self, response): 
    le = LinkExtractor(allow=("download"))
    for link in le.extract_links(response):
            yield {'url': link.urk}

如果您使用scrapy crawl myspider -o items.json拨打此电话，您将获得所有json格式的下载链接。
没有理由保存到列表的下载，因为它不再是你编写的这个蜘蛛模板（类），基本上它没有任何意义。

将URL链接解析为列表

1 个答案: