在“ a”标签下提取“ href”

时间:2019-05-31 15:53:09

标签: python html web-scraping beautifulsoup tags

我正在尝试提取“ a href =“ link” ...“下的链接

由于存在多行,因此我对每个行进行迭代。每行的第一个链接是我需要的链接,因此我使用find_all('tr')和find('a')。 我知道find('a')返回一个Nonetype,但是不知道如何解决这个问题

我有一段有效的代码,但是效率很低(在注释中)。

sauce = urllib.request.urlopen('https://morocco.observation.org/soortenlijst_wg_v3.php')
soup = bs.BeautifulSoup(sauce, 'lxml')

tabel = soup.find('table', {'class': 'tablesorter'})
for i in tabel.find_all('tr'):
#     if 'view' in i.get('href'):
#         link_list.append(i.get('href'))

    link = i.find('a')
#<a class="z1" href="/soort/view/164?from=1987-12-05&amp;to=2019-05-31">Common Reed Bunting - <em>Emberiza schoeniclus</em></a>     

如何检索href下的链接并解决Nonetype仅从/ soort / view / 164?from = 1987-12-05&to = 2019-05-31开始的问题?

预先感谢

2 个答案:

答案 0 :(得分:0)

link = i.find('a')
_href = link['href']
print(_href)

O / P:

"/soort/view/164?from=1987-12-05&to=2019-05-31?"

这不是正确的网址链接,您应将其与域名连接

new_url = "https://morocco.observation.org"+_href
print(new_url)

O / p:

https://morocco.observation.org/soort/view/164?from=1987-12-05&to=2019-05-31

更新

from bs4 import BeautifulSoup
from bs4.element import Tag
import requests

resp = requests.get("https://morocco.observation.org/soortenlijst_wg_v3.php")
soup = BeautifulSoup(resp.text, 'lxml')
tabel = soup.find('table', {'class': 'tablesorter'})
base_url = "https://morocco.observation.org"

for i in tabel.find_all('tr'):
    link = i.find('a',href=True)
    if link is None or not isinstance(link,Tag):
        continue

    url = base_url + link['href']
    print(url)

O / P:

https://morocco.observation.org/soort/view/248?from=1975-05-05&to=2019-06-01
https://morocco.observation.org/soort/view/174?from=1989-12-15&to=2019-06-01
https://morocco.observation.org/soort/view/57?from=1975-05-05&to=2019-06-01
https://morocco.observation.org/soort/view/19278?from=1975-05-13&to=2019-06-01
https://morocco.observation.org/soort/view/56?from=1993-03-25&to=2019-06-01
https://morocco.observation.org/soort/view/1504?from=1979-05-25&to=2019-06-01
https://morocco.observation.org/soort/view/78394?from=1975-05-09&to=2019-06-01
https://morocco.observation.org/soort/view/164?from=1987-12-05&to=2019-06-01

答案 1 :(得分:0)

一种逻辑方法是使用nth-of-type隔离目标列

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://morocco.observation.org/soortenlijst_wg_v3.php')
soup = bs(r.content, 'lxml')
base = 'https://morocco.observation.org'
urls = [base + item['href'] for item in soup.select('#mytable_S td:nth-of-type(3) a')]

您还可以传递课程列表

urls = [base + item['href'] for item in soup.select('.z1, .z2,.z3,.z4')]

或者甚至将class的运算符以^开头

urls = [base + item['href'] for item in soup.select('[class^=z]')]

或者包含href的*运算符

urls = [base + item['href'] for item in soup.select('[href*=view]')]

在此处了解不同的CSS选择器方法:https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors