我是新来的Scrapy,如何从课堂外通过start_urls
,
我试图在课程外制作start_urls
,但没有用。我要做的是创建一个文件,其文件名为字典(search_dict
),并将其值作为起始URL对于Scrapy
search_dict={'hello world':'https://www.google.com/search?q=hello+world',
'my code':'https://www.google.com/search?q=stackoverflow+questions',
'test':'https://www.google.com/search?q="test"'}
class googlescraper(scrapy.Spider):
name = "test"
allowed_domains = ["google.com"]
#start_urls= ??
found_items = []
def parse:
item=dict()
#code here
self.found_items.append(item)
for k,v in search_dict.items():
with open(k,'w') as csvfile:
process = CrawlerProcess({
'DOWNLOAD_DELAY': 0,
'LOG_LEVEL': 'DEBUG',
'DOWNLOAD_TIMEOUT':30,})
process.crawl(googlescraper) #scrapy spider needs start url
spider = next(iter(process.crawlers)).spider
process.start()
dict_writer = csv.DictWriter(csvfile, keys)
dict_writer.writeheader()
dict_writer.writerows(spider.found_items)
答案 0 :(得分:2)
Scrapy文档中有一个使用以下参数实例化搜寻器的示例:https://docs.scrapy.org/en/latest/topics/spiders.html#spider-arguments
您可以通过以下方式传递您的网址:
# ...
class GoogleScraper(scrapy.Spider):
# ...
# Omit `start_urls` in the class definition
# ...
process.crawl(GoogleScraper, start_urls=[
# The URL you want to pass here
])
对kwargs
的调用中的process.crawl()
将传递给Spider初始化程序。默认的初始化程序将复制任何kwargs
作为Spider类的属性。因此,这等效于在类定义中设置start_urls
。
Scrapy文档中的相关部分:https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess.crawl