如何使用python / pandas从href获得href链接

时间:2018-11-13 11:35:34

标签: python pandas beautifulsoup python-requests

我需要获取存在于href(我已经拥有)中的href链接,因此我需要点击该href链接并收集其他href。我试过了,但是从该代码中只有第一个href才被获取,想要打到那个并收集在前一个中存在的href。那我该怎么办 我尝试过:

from bs4 import BeautifulSoup
import requests
url = 'https://www.iea.org/oilmarketreport/reports/'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
#soup.prettify()
#table = soup.find("table")
#print(table)
links = []
for href in soup.find_all(class_='omrlist'):
    #print(href)
    links.append(href.find('a').get('href'))
print(links) 

1 个答案:

答案 0 :(得分:1)

此处如何循环获取报告网址

import requests

root_url = 'https://www.iea.org'

def getLinks(url):
    all_links = []
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    for href in soup.find_all(class_='omrlist'):
        all_links.append(root_url + href.find('a').get('href'))  # add prefix 'http://....'
    return all_links

yearLinks = getLinks(root_url + '/oilmarketreport/reports/')

# get report URL
reportLinks = []
for url in yearLinks:
    links = getLinks(url)
    reportLinks.extend(links)

print(reportLinks)
for url in reportLinks:
    if '.pdf' in url:
        url = url.replace('../../..', '')
        # do download pdf file
        ....
    else:
        # do extract pdf url from html and download it
        ....
    ....

现在您可以循环reportLinks来获取pdf网址