如何使这个递归爬行函数迭代?

时间:2009-03-29 09:13:30

标签: python recursion web-crawler

出于学术和性能的考虑,考虑到这种爬行递归式网页抓取功能(仅在给定域内抓取),哪种方法可以使迭代运行?目前,当它运行完毕时,python已经攀升到使用超过1GB的内存,而这在共享环境中运行是不可接受的。

   def crawl(self, url):
    "Get all URLS from which to scrape categories."
    try:
      links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
      return
    for link in links:
      for attr in link.attrs:
        if Crawler._match_attr(attr):
          if Crawler._is_category(attr):
            pass
          elif attr[1] not in self._crawled:
            self._crawled.append(attr[1])
            self.crawl(attr[1])

4 个答案:

答案 0 :(得分:12)

使用BFS而不是递归爬行(DFS):http://en.wikipedia.org/wiki/Breadth_first_search

您可以将外部存储解决方案(例如数据库)用于BFS队列以释放RAM。

算法是:

//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.

// initialization:
urlsToVisit.Add( rootUrl );

while(urlsToVisit.Count > 0) {
  var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
  var page = FetchPage(nextUrl);
  ProcessPage(page);
  visitedUrls.Add(nextUrl);
  var links = ParseLinks(page);
  foreach (var link in links)
     if (!visitedUrls.Contains(link))
        urlsToVisit.Add(link); 
}

答案 1 :(得分:5)

您可以将新URL抓取到队列中,而不是递归。然后运行直到队列为空而不递归。如果将队列放入文件中,则几乎不使用任何内存。

答案 2 :(得分:2)

@Mehrdad - 感谢您的回复,您提供的示例简洁易懂。

解决方案:

  def crawl(self, url):
    urls = Queue(-1)
    _crawled = []

    urls.put(url)

    while not urls.empty():
      url = urls.get()
      try:
        links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
      except urllib2.HTTPError:
        continue
      for link in links:
        for attr in link.attrs:
          if Crawler._match_attr(attr):
            if Crawler._is_category(attr):
              continue
            else:
              Crawler._visit(attr[1])
              if attr[1] not in _crawled:
                urls.put(attr[1])

答案 3 :(得分:0)

只需将links用作队列即可轻松完成:

def get_links(url):
    "Extract all matching links from a url"
    try:
        links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
        return []

def crawl(self, url):
    "Get all URLS from which to scrape categories."
    links = get_links(url)
    while len(links) > 0:
        link = links.pop()
        for attr in link.attrs:
            if Crawler._match_attr(attr):
                if Crawler._is_category(attr):
                    pass
            elif attr[1] not in self._crawled:
                self._crawled.append(attr[1])
                # prepend the new links to the queue
                links = get_links(attr[1]) + links

当然,这并不能解决内存问题......