从页面上的按钮刮取链接

时间:2017-08-02 16:32:14

标签: python web-scraping beautifulsoup

我正试图从此page上的“盒子分数”按钮中删除链接。按钮应该看起来像这样

http://www.espn.com/nfl/boxscore?gameId=400874795

我尝试使用此代码来查看是否可以访问按钮,但我不能。

from bs4 import BeautifulSoup
import requests

url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'

advanced = url
r = requests.get(advanced)
data = r.text
soup = BeautifulSoup(data,"html.parser")

for link in soup.find_all('a'):
    print link

3 个答案:

答案 0 :(得分:1)

正如Wpercy在评论中提到的那样,您无法使用requests执行此操作,建议您应将 selenium Chromedriver一起使用用于处理JavaScript的 / PhantomJS

from selenium import webdriver
from bs4 import BeautifulSoup

url = "http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')

boxList = soup.findAll('a',{'name':'&lpos=nfl:scoreboard:boxscore'})

所有得分按钮的a标记都有name = &lpos=nfl:scoreboard:boxscore属性,因此我们首先使用.findAll,现在简单的列表推导可以提取每个href属性:

>>> links = [box['href'] for box in boxList]
>>> links
['/nfl/boxscore?gameId=400874795', '/nfl/boxscore?gameId=400874854', '/nfl/boxscore?gameId=400874753', '/nfl/boxscore?gameId=400874757', '/nfl/boxscore?gameId=400874772', '/nfl/boxscore?gameId=400874777', '/nfl/boxscore?gameId=400874767', '/nfl/boxscore?gameId=400874812', '/nfl/boxscore?gameId=400874761', '/nfl/boxscore?gameId=400874764', '/nfl/boxscore?gameId=400874781', '/nfl/boxscore?gameId=400874796', '/nfl/boxscore?gameId=400874750', '/nfl/boxscore?gameId=400873867', '/nfl/boxscore?gameId=400874775', '/nfl/boxscore?gameId=400874798']

答案 1 :(得分:0)

这是我所做的解决方案,它会抓取您在答案中提供的网址上的所有链接。你可以看看吗

# from BeautifulSoup import *
from bs4 import BeautifulSoup
# import requests
import urllib
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'

# advanced = url
html = urllib.urlopen(url).read()
# r = requests.get(html)
# data = r.text
soup = BeautifulSoup(html)

tags = soup('a')

# for link in soup.find_all('a'):
for i,tag in enumerate(tags):
    # print tag;
    print i;
    ans = tag.get('href',None)
    print ans;
    print "\n";

答案 2 :(得分:0)

Gopal Chitalia的答案对我不起作用,因此我决定发布有效的(针对python 3.6.5)

# from BeautifulSoup import *
from bs4 import BeautifulSoup
# import requests
import urllib
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'

# advanced = url
html = urllib.request.urlopen(url)
# urlopen(url).read()
# r = requests.get(html)
# data = r.text
soup = BeautifulSoup(html)

tags = soup('a')

# for link in soup.find_all('a'):
for i,tag in enumerate(tags):
    # print tag;
    print (i);
    ans = tag.get('href',None)
    print (ans);
    print ("\n");