使用Scrapy递归剪切Pinboard - “蜘蛛必须返回请求”错误

时间:2017-08-20 06:42:00

标签: python recursion scrapy yield

为了磨练我的python和Spark GraphX技能,我一直在尝试构建Pinboard用户和书签的图表。为此,我以下列方式递归地抓取Pinboard书签:

  1. 从用户开始并抓取所有书签
  2. 对于由url_slug标识的每个书签,找到已保存相同书签的所有用户。
  3. 对于步骤2中的每个用户,重复此过程(转到1,...)
  4. 尽管已尝试过多个线程的建议(包括使用规则),但当我尝试实现此逻辑时,我收到以下错误:

      

    错误:Spider必须返回Request,BaseItem,dict或None,得到   '发电机'

    我强烈怀疑我的代码中混合了yield / return

    这里有我的代码的简短描述:

    我的主要解析方法查找一个用户的所有书签项目(也跟随任何以前具有同一用户书签的页面)并生成parse_bookmark方法来刮取这些书签。

    class PinSpider(scrapy.Spider):
        name = 'pinboard'
    
        # Before = datetime after 1970-01-01 in seconds, used to separate the bookmark pages of a user
        def __init__(self, user='notiv', before='3000000000', *args, **kwargs):
            super(PinSpider, self).__init__(*args, **kwargs)
            self.start_urls = ['https://pinboard.in/u:%s/before:%s' % (user, before)]
            self.before = before
    
        def parse(self, response):
            # fetches json representation of bookmarks instead of using css or xpath
            bookmarks = re.findall('bmarks\[\d+\] = (\{.*?\});', response.body.decode('utf-8'), re.DOTALL | re.MULTILINE)
    
            for b in bookmarks:
                bookmark = json.loads(b)
                yield self.parse_bookmark(bookmark)
    
            # Get bookmarks in previous pages
            previous_page = response.css('a#top_earlier::attr(href)').extract_first()
            if previous_page:
                previous_page = response.urljoin(previous_page)
                yield scrapy.Request(previous_page, callback=self.parse)
    

    此方法抓取书签的信息,包括相应的url_slug,将其存储在PinscrapyItem中,然后生成scrapy.Request来解析url_slug:

    def parse_bookmark(self, bookmark):
        pin = PinscrapyItem()
    
        pin['url_slug'] = bookmark['url_slug']
        pin['title'] = bookmark['title']
        pin['author'] = bookmark['author']
    
        # IF I REMOVE THE FOLLOWING LINE THE PARSING OF ONE USER WORKS (STEP 1) BUT NO STEP 2 IS PERFORMED  
        yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)
    
        return pin
    

    最后,parse_url_slug方法找到保存此书签的其他用户,并递归地生成scrape.Request来解析每个用户。

    def parse_url_slug(self, response):
        url_slug = UrlSlugItem()
    
        if response.body:
            soup = BeautifulSoup(response.body, 'html.parser')
    
            users = soup.find_all("div", class_="bookmark")
            user_list = [re.findall('/u:(.*)/t:', element.a['href'], re.DOTALL) for element in users]
            user_list_flat = sum(user_list, []) # Change from list of lists to list
    
            url_slug['user_list'] = user_list_flat
    
            for user in user_list:
                yield scrapy.Request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)
    
        return url_slug
    

    (为了以更简洁的方式呈现代码,我删除了存储其他有趣字段或检查重复字段等的部分。)

    非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

问题在于您的下面的代码块

yield self.parse_bookmark(bookmark)

因为在parse_bookmark中你有两行以下

# IF I REMOVE THE FOLLOWING LINE THE PARSING OF ONE USER WORKS (STEP 1) BUT NO STEP 2 IS PERFORMED  
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)

return pin

由于你有yield这个函数的返回值是一个生成器。并且你将这个发生器收回Scrapy并且它不知道如何处理它。

修复很简单。将您的代码更改为

yield from self.parse_bookmark(bookmark)

这将从发电机而不是发电机本身一次产生一个值。或者你也可以这样做

for ret in self.parse_bookmark(bookmark):
    yield ret

修改-1

更改您的功能以首先生成项目

yield pin
yield scrapy.Request('https://pinboard.in/url:' + pin['url_slug'], callback=self.parse_url_slug)

还有其他人

    url_slug['user_list'] = user_list_flat
    yield url_slug
    for user in user_list:
        yield scrapy.Request('https://pinboard.in/u:%s/before:%s' % (user, self.before), callback=self.parse)

稍后让步会安排很多其他请求,当你开始看到被删除的项目时需要时间。我在代码上面运行了更改并且它很适合我

2017-08-20 14:02:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pinboard.in/u:%5B'semanticdreamer'%5D/before:3000000000>
{'url_slug': 'e1ff3a9fb18873e494ec47d806349d90fec33c66', 'title': 'Flair Conky Offers Dark & Light Version For All Linux Distributions - NoobsLab | Ubuntu/Linux News, Reviews, Tutorials, Apps', 'author': 'semanticdreamer'}
2017-08-20 14:02:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pinboard.in/url:d9c16292ec9019fdc8411e02fe4f3d6046185c58>
{'user_list': ['ronert', 'notiv']}