设置课程外的刮擦开始网址

时间:2019-06-03 23:17:22

标签: python scrapy

我是新来的Scrapy,如何从课堂外通过start_urls, 我试图在课程外制作start_urls,但没有用。我要做的是创建一个文件,其文件名为字典(search_dict),并将其值作为起始URL对于Scrapy

search_dict={'hello world':'https://www.google.com/search?q=hello+world',
            'my code':'https://www.google.com/search?q=stackoverflow+questions',
            'test':'https://www.google.com/search?q="test"'}

class googlescraper(scrapy.Spider):
    name = "test"
    allowed_domains = ["google.com"]
    #start_urls= ??
    found_items = []
    def parse:
        item=dict()
        #code here
        self.found_items.append(item)

for k,v in search_dict.items():
    with open(k,'w') as csvfile:
        process = CrawlerProcess({
            'DOWNLOAD_DELAY': 0,
            'LOG_LEVEL': 'DEBUG',
            'DOWNLOAD_TIMEOUT':30,})
        process.crawl(googlescraper) #scrapy spider needs start url
        spider = next(iter(process.crawlers)).spider
        process.start()
        dict_writer = csv.DictWriter(csvfile, keys)
        dict_writer.writeheader()
        dict_writer.writerows(spider.found_items)

1 个答案:

答案 0 :(得分:2)

Scrapy文档中有一个使用以下参数实例化搜寻器的示例:https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments

您可以通过以下方式传递您的网址:

# ...

class GoogleScraper(scrapy.Spider):
    # ...
    # Omit `start_urls` in the class definition
    # ...

process.crawl(GoogleScraper, start_urls=[
    # The URL you want to pass here
])

kwargs的调用中的process.crawl()将传递给Spider初始化程序。默认的初始化程序将复制任何kwargs作为Spider类的属性。因此,这等效于在类定义中设置start_urls

Scrapy文档中的相关部分:https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess.crawl