如何获取li标签内的链接?

时间:2013-09-09 09:48:10

标签: python python-2.7 beautifulsoup

我有这段代码:

import urllib
from bs4 import BeautifulSoup
url = "http://download.cnet.com/windows/"
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.catFlyout a[href]"):
    print "http://download.cnet.com"+a["href"]

但是这段代码没有给出正确的输出。 正确的输出应该是这样的:

http://download.cnet.com/windows/security-software/
http://download.cnet.com/windows/browsers/
http://download.cnet.com/windows/business-software/
..
..
http://download.cnet.com/windows/video-software/

1 个答案:

答案 0 :(得分:1)

列表中有一些相对和绝对链接,仅当链接以http开头时才会在前面添加基本网址:

for a in soup.select("div.catFlyout a[href]"):
    if not a["href"].startswith("http"):
        print "http://download.cnet.com"+a["href"]
    else:
        print a["href"]

或者,使用urlparse检查链接是否是绝对的(取自here):

import urllib
import urlparse
from bs4 import BeautifulSoup

def is_absolute(url):
    return bool(urlparse.urlparse(url).scheme)

url = "http://download.cnet.com/windows/"
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
for a in soup.select("div.catFlyout a[href]"):
    if not is_absolute(a['href']):
        print "http://download.cnet.com"+a["href"]
    else:
        print a["href"]