使用python从网站获取所有网址

时间:2014-06-21 13:43:31

标签: python beautifulsoup urllib2 web-crawler

我正在学习构建网络抓取工具,目前正致力于从网站获取所有网址。我一直在玩,并没有像以前那样的代码,但我能够得到所有的链接,但我的问题是我需要反复做同样的事情,但我认为我的问题是递归它正在做什么对我写的代码是正确的。我的代码如下:

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
    page = urllib2.urlopen( url ).read()
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        for url in urlList:
            getAllUrl(url)

        return urlList
    except urllib2.HTTPError, e:
        print e

if __name__ == "__main__":
    urls = getAllUrl('http://bobthemac.com')
    for x in urls:
        print x

我想要实现的是获取当前设置的网站的所有网址,程序运行直到内存耗尽我想要的只是从网站获取网址。有没有人知道如何做到这一点认为我有正确的想法只需要对代码进行一些小改动。

修改

对于那些你感兴趣的人,我的工作代码可以获得网站上所有人可能认为有用的urs。这不是最好的代码,确实需要一些工作,但有些工作可能会很好。

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
urlList = []
try:
    page = urllib2.urlopen( url ).read()
    soup = BeautifulSoup(page)
    soup.prettify()
    for anchor in soup.findAll('a', href=True):
        if not 'http://' in anchor['href']:
            if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
                urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
        else:
            if anchor['href'] not in urlList:
                urlList.append(anchor['href'])

    return urlList

except urllib2.HTTPError, e:
    urlList.append( e )

if __name__ == "__main__":
urls = getAllUrl('http://bobthemac.com')

fullList = []

for x in urls:
    listUrls = list
    listUrls = getAllUrl(x)
    try:
        for i in listUrls:
            if not i in fullList:
                fullList.append(i)
    except TypeError, e:
        print 'Woops wrong content passed'

for i in fullList:
    print i

2 个答案:

答案 0 :(得分:1)

在函数getAllUrl中,您在getAllUrl循环中再次调用for,它会进行递归。

元素一旦放入urlList就永远不会移出,所以urlList永远不会为空,然后,递归永远不会分解。

这就是为什么你的程序永远不会用尽内存的原因。

答案 1 :(得分:1)

我认为这有效:

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
    try:
        page = urllib2.urlopen( url ).read()
    except:
        return []
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin(url, anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin(url, anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        return urlList
    except urllib2.HTTPError, e:
        print e

def listAllUrl(urls):
    for x in urls:
        print x
        urls.remove(x)
        urls_tmp = getAllUrl(x)
        for y in urls_tmp:
            urls.append(y)


if __name__ == "__main__":
    urls = ['http://bobthemac.com']
    while(urls.count>0):
        urls = getAllUrl('http://bobthemac.com')
        listAllUrl(urls)