是否可以在scrapy中执行此代码模式?

时间:2014-12-25 19:25:36

标签: python web-scraping scrapy

使用scrapy,我想首先从某些页面收集网址,然后解析找到的每个网址并生成该项目。

例如,代码如下:

def parse(self, response):
    # collect urls first
    urls = self.collect_urls(response)

    # parse urls found
    for url in urls:
        self.parse_url(url) # will yield Item inside


def collect_urls(reponse):
    urls = response.meta.get('urls')
    if urls is None:
        urls = set()

    # do some logic of collecting urls from response into urls set
    # ...

    if is_still_has_data(response):
        # continue collecting urls in other page
        yield scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'}, 
            meta={'urls': urls}, callback=self.collect_urls)
    else:
        return urls     # error here

问题是我无法在具有yield的函数内返回对象。

然后我将urls作为类属性/成员,如下所示:

urls = set()

def parse(self, response):
    # collect urls first
    yield self.collect_urls(response)

    # parse urls found
    for url in urls:
        self.parse_url(url) # will yield Item inside


def collect_urls(reponse):
    # do some logic of collecting urls from response into urls set
    # ...

    if is_still_has_data(response):
        # continue collecting urls in other page
        return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'}, 
            callback=self.collect_urls)

此代码问题在调用yield self.collect_urls(response)后,它将直接继续for url in urls:部分,而不是等待collect_urls函数完成。如果我删除yieldcollect_urls函数只会调用一次,FormRequest中的回调不起作用。似乎只有在产生FormRequest的情况下回调才有效。

我知道将for url in urls:部分移动到collect_urls函数有解决方案,但我想知道是否有可能在scrapy中实现我想要的代码模式?

2 个答案:

答案 0 :(得分:0)

如果你有一个产生某种功能的功能,你基本上把它变成了Python generator,你就不能再回复任何东西了。

但是,即使您在产生了请求或项目后仍无法返回项目列表,如果您有要返回的序列,您只需迭代它并产生:

def some_callback(self, response):
    # ... yield something here

    requests = get_next_requests_list(response)

    # can't return requests list, so we iterate and yield:
    for req in requests:
        yield req

此外,Scrapy只会关注请求并收集回调产生的项目。所以,如果你想从另一个回调中触发回调,你必须得到调用它的结果:

def some_callback(self, response):
    # ... do stuff here, yields a few items or requests

    for rr in another_callback(response):
        yield rr

我希望这有助于解决您的问题。

答案 1 :(得分:0)

经过一些尝试后,我认为无法执行此代码模式,因为请求的回调无法将控制权返回给原始请求调用方/ yielder。

我可以做的一个解决方案是我必须链接回调直到找不到网址,然后解析找到的每个网址:

def parse(self, response):
    urls = response.meta.get('urls')
    if urls is None:
        urls = set()

    # do some logic of collecting urls from response into urls set
    # ...

    if is_still_has_data(response):
        # continue collecting urls in other page
        return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'}, 
            meta={'urls': urls}, callback=self.parse)
    else:
        return self.do_loop_urls(urls)


def do_loop_urls(self, urls):
    # parse urls found
    for url in urls:
        yield self.parse_url(url) # will yield Item inside

假设有3页,过程的图片是这样的:

  

解析 - >解析 - >解析 - > do_loop_urls