Question

有没有办法告诉scrapy根据二级页面中的条件停止爬行？我正在做以下事情：

我有一个start_url（第一级页面）
我使用解析从start_url中提取了一组url（self，响应）
然后我使用Request with callback为链接添加队列作为parseDetailPage（self，response）
在parseDetail（第二级页面）下，我知道是否可以停止抓取

现在我正在使用CloseSpider（）来完成此任务，但问题是，当我开始抓取二级页面时，要解析的URL已经排队了，我不知道如何从队列中删除它们。有没有办法顺序抓取链接列表，然后能够在parseDetailPage中停止？

global job_in_range    
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
    self.job_in_range = True
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('//blockquote[@id="toc_rows"]')
    items = []
    if results:
        links = results.select('.//p[@class="row"]/a/@href')
        for link in links:
            if link is self.end_url:
                break;
            nextUrl = link.extract()
            isValid = WPUtil.validateUrl(nextUrl);
            if isValid:
                item = WoodPeckerItem()
                item['url'] = nextUrl
                item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
                items.append(item)
    else:
        self.error.log('Could not parse the document')
    return items

def parseDetailPage(self, response):
    if self.job_in_range is False:
        raise CloseSpider('End date reached - No more crawling for ' + self.name)
    hxs = HtmlXPathSelector(response)
    print response
    body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
    item = response.meta['item']
    item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract()
    if item['jobTitle'] is 'Admin':
        self.job_in_range = False
        raise CloseSpider('Stop crawling')
    item['jobTitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract()
    item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract()
    return item

Answer 1

你的意思是你想停止蜘蛛并恢复它而不解析已被解析的网址？如果是这样，您可以尝试设置the JOB_DIR setting。此设置可以将request.queue保留在磁盘上的指定文件中。

使用scrapy顺序抓取网站

1 个答案: