如何使用beautifulsoup从网页中递归查找所有链接?

时间:2017-10-08 09:41:38

标签: python recursion beautifulsoup

我一直在尝试使用我发现的in this answer代码来递归查找给定网址中的所有链接:

import urllib2
from bs4 import BeautifulSoup

url = "http://francaisauthentique.libsyn.com/"

def recursiveUrl(url,depth):

    if depth == 5:
        return url
    else:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        newlink = soup.find('a') #find just the first one
        if len(newlink) == 0:
            return url
        else:
            return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    links = soup.find_all('a')
    for link in links:
        links.append(recursiveUrl(link,0))
    return links

links = getLinks(url)
print(links)

除了警告

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

我收到以下错误:

Traceback (most recent call last):
  File "downloader.py", line 28, in <module>
    links = getLinks(url)
  File "downloader.py", line 25, in getLinks
    links.append(recursiveUrl(link,0))
  File "downloader.py", line 11, in recursiveUrl
    page=urllib2.urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 396, in open
    protocol = req.get_type()
TypeError: 'NoneType' object is not callable

有什么问题?

2 个答案:

答案 0 :(得分:0)

您的recursiveUrl尝试访问无效的网址链接,例如:/ webpage / category / general,这是您从其中一个href链接中提取的值。

您应该将提取的href值附加到网站的网址,然后尝试打开网页。您将需要处理递归算法,因为我不知道您想要实现什么。

代码:

import requests
from bs4 import BeautifulSoup

def recursiveUrl(url, link, depth):
    if depth == 5:
        return url
    else:
        print(link['href'])
        page = requests.get(url + link['href'])
        soup = BeautifulSoup(page.text, 'html.parser')
        newlink = soup.find('a')
        if len(newlink) == 0:
            return link
        else:
            return link, recursiveUrl(url, newlink, depth + 1)

def getLinks(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    for link in links:
        links.append(recursiveUrl(url, link, 0))
    return links

links = getLinks("http://francaisauthentique.libsyn.com/")
print(links)

输出:

http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/10
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/09
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/08
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/07
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general

答案 1 :(得分:0)

此代码将递归地转到每个链接,并继续将完整的URL附加到列表中。最终输出将是一堆网址

import requests
from bs4 import BeautifulSoup

listUrl = []

def recursiveUrl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    if links is None or len(links) == 0:
        listUrl.append(url)
        print(url)
        return 1;
    else:
        listUrl.append(url)
        print(url)
        for link in links:
            #print(url+link['href'][1:])
            recursiveUrl(url+link['href'][1:])


recursiveUrl('http://target.com')
print(listUrl)