使用scrapy,我想首先从某些页面收集网址,然后解析找到的每个网址并生成该项目。
例如,代码如下:
def parse(self, response):
# collect urls first
urls = self.collect_urls(response)
# parse urls found
for url in urls:
self.parse_url(url) # will yield Item inside
def collect_urls(reponse):
urls = response.meta.get('urls')
if urls is None:
urls = set()
# do some logic of collecting urls from response into urls set
# ...
if is_still_has_data(response):
# continue collecting urls in other page
yield scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'},
meta={'urls': urls}, callback=self.collect_urls)
else:
return urls # error here
问题是我无法在具有yield
的函数内返回对象。
然后我将urls
作为类属性/成员,如下所示:
urls = set()
def parse(self, response):
# collect urls first
yield self.collect_urls(response)
# parse urls found
for url in urls:
self.parse_url(url) # will yield Item inside
def collect_urls(reponse):
# do some logic of collecting urls from response into urls set
# ...
if is_still_has_data(response):
# continue collecting urls in other page
return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'},
callback=self.collect_urls)
此代码问题在调用yield self.collect_urls(response)
后,它将直接继续for url in urls:
部分,而不是等待collect_urls
函数完成。如果我删除yield
,collect_urls
函数只会调用一次,FormRequest中的回调不起作用。似乎只有在产生FormRequest的情况下回调才有效。
我知道将for url in urls:
部分移动到collect_urls
函数有解决方案,但我想知道是否有可能在scrapy中实现我想要的代码模式?
答案 0 :(得分:0)
如果你有一个产生某种功能的功能,你基本上把它变成了Python generator,你就不能再回复任何东西了。
但是,即使您在产生了请求或项目后仍无法返回项目列表,如果您有要返回的序列,您只需迭代它并产生:
def some_callback(self, response):
# ... yield something here
requests = get_next_requests_list(response)
# can't return requests list, so we iterate and yield:
for req in requests:
yield req
此外,Scrapy只会关注请求并收集回调产生的项目。所以,如果你想从另一个回调中触发回调,你必须得到调用它的结果:
def some_callback(self, response):
# ... do stuff here, yields a few items or requests
for rr in another_callback(response):
yield rr
我希望这有助于解决您的问题。
答案 1 :(得分:0)
经过一些尝试后,我认为无法执行此代码模式,因为请求的回调无法将控制权返回给原始请求调用方/ yielder。
我可以做的一个解决方案是我必须链接回调直到找不到网址,然后解析找到的每个网址:
def parse(self, response):
urls = response.meta.get('urls')
if urls is None:
urls = set()
# do some logic of collecting urls from response into urls set
# ...
if is_still_has_data(response):
# continue collecting urls in other page
return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'},
meta={'urls': urls}, callback=self.parse)
else:
return self.do_loop_urls(urls)
def do_loop_urls(self, urls):
# parse urls found
for url in urls:
yield self.parse_url(url) # will yield Item inside
假设有3页,过程的图片是这样的:
解析 - >解析 - >解析 - > do_loop_urls