以递归方式浏览和抓取网页

时间:2013-08-14 00:38:28

标签: python recursion scrapy

如何使用scrapy python库制作以下抓取工具,以递归方式浏览整个网站:

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/"
    ]

def parse(self, response):

    hxs = HtmlXPathSelector(response)

    titles = hxs.select('//ul[@class="directory-url"]/li/a/text()').extract()

    for t in titles:
        print "Title: ", t

我在一个页面上尝试过这个:

start_urls = [
    "http://www.dmoz.org/Society/Philosophy/Academic_Departments/Africa/"
]

它运行良好但只返回起始网址的结果,并且不遵循域中的链接。 我想这必须用Scrapy手动完成,但不知道如何。

1 个答案:

答案 0 :(得分:2)

尝试使用CrawlSpider(请参阅documentation),其中一个Rule()只有LinkExtractor,只过滤您想要的域名:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/"
    ]

    rules = (
        Rule(
            SgmlLinkExtractor(allow_domains=("dmoz.org",)),
            callback='parse_page', follow=True
        ),
    )

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//ul[@class="directory-url"]/li/a/text()').extract()
        for t in titles:
            print "Title: ", t

必须调用回调,而不是parse(请参阅this warning