我一直在尝试使用我发现的in this answer代码来递归查找给定网址中的所有链接:
import urllib2
from bs4 import BeautifulSoup
url = "http://francaisauthentique.libsyn.com/"
def recursiveUrl(url,depth):
if depth == 5:
return url
else:
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
newlink = soup.find('a') #find just the first one
if len(newlink) == 0:
return url
else:
return url, recursiveUrl(newlink,depth+1)
def getLinks(url):
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
links = soup.find_all('a')
for link in links:
links.append(recursiveUrl(link,0))
return links
links = getLinks(url)
print(links)
除了警告
/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "lxml")
我收到以下错误:
Traceback (most recent call last):
File "downloader.py", line 28, in <module>
links = getLinks(url)
File "downloader.py", line 25, in getLinks
links.append(recursiveUrl(link,0))
File "downloader.py", line 11, in recursiveUrl
page=urllib2.urlopen(url)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 396, in open
protocol = req.get_type()
TypeError: 'NoneType' object is not callable
有什么问题?
答案 0 :(得分:0)
您的recursiveUrl尝试访问无效的网址链接,例如:/ webpage / category / general,这是您从其中一个href链接中提取的值。
您应该将提取的href值附加到网站的网址,然后尝试打开网页。您将需要处理递归算法,因为我不知道您想要实现什么。
代码:
import requests
from bs4 import BeautifulSoup
def recursiveUrl(url, link, depth):
if depth == 5:
return url
else:
print(link['href'])
page = requests.get(url + link['href'])
soup = BeautifulSoup(page.text, 'html.parser')
newlink = soup.find('a')
if len(newlink) == 0:
return link
else:
return link, recursiveUrl(url, newlink, depth + 1)
def getLinks(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.find_all('a')
for link in links:
links.append(recursiveUrl(url, link, 0))
return links
links = getLinks("http://francaisauthentique.libsyn.com/")
print(links)
输出:
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/10
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/09
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/08
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/2017/07
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
http://francaisauthentique.libsyn.com//webpage/category/general
答案 1 :(得分:0)
此代码将递归地转到每个链接,并继续将完整的URL附加到列表中。最终输出将是一堆网址
import requests
from bs4 import BeautifulSoup
listUrl = []
def recursiveUrl(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.find_all('a')
if links is None or len(links) == 0:
listUrl.append(url)
print(url)
return 1;
else:
listUrl.append(url)
print(url)
for link in links:
#print(url+link['href'][1:])
recursiveUrl(url+link['href'][1:])
recursiveUrl('http://target.com')
print(listUrl)