Question

我正试图从此page上的“盒子分数”按钮中删除链接。按钮应该看起来像这样

http://www.espn.com/nfl/boxscore?gameId=400874795

我尝试使用此代码来查看是否可以访问按钮，但我不能。

from bs4 import BeautifulSoup
import requests

url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'

advanced = url
r = requests.get(advanced)
data = r.text
soup = BeautifulSoup(data,"html.parser")

for link in soup.find_all('a'):
    print link

Answer 1

正如Wpercy在评论中提到的那样，您无法使用requests执行此操作，建议您应将 selenium 与Chromedriver一起使用用于处理JavaScript的 / PhantomJS ：

from selenium import webdriver
from bs4 import BeautifulSoup

url = "http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html,'html.parser')

boxList = soup.findAll('a',{'name':'&lpos=nfl:scoreboard:boxscore'})

所有得分按钮的a标记都有name = &lpos=nfl:scoreboard:boxscore属性，因此我们首先使用.findAll，现在简单的列表推导可以提取每个href属性：

>>> links = [box['href'] for box in boxList]
>>> links
['/nfl/boxscore?gameId=400874795', '/nfl/boxscore?gameId=400874854', '/nfl/boxscore?gameId=400874753', '/nfl/boxscore?gameId=400874757', '/nfl/boxscore?gameId=400874772', '/nfl/boxscore?gameId=400874777', '/nfl/boxscore?gameId=400874767', '/nfl/boxscore?gameId=400874812', '/nfl/boxscore?gameId=400874761', '/nfl/boxscore?gameId=400874764', '/nfl/boxscore?gameId=400874781', '/nfl/boxscore?gameId=400874796', '/nfl/boxscore?gameId=400874750', '/nfl/boxscore?gameId=400873867', '/nfl/boxscore?gameId=400874775', '/nfl/boxscore?gameId=400874798']

Answer 2

这是我所做的解决方案，它会抓取您在答案中提供的网址上的所有链接。你可以看看吗

# from BeautifulSoup import *
from bs4 import BeautifulSoup
# import requests
import urllib
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'

# advanced = url
html = urllib.urlopen(url).read()
# r = requests.get(html)
# data = r.text
soup = BeautifulSoup(html)

tags = soup('a')

# for link in soup.find_all('a'):
for i,tag in enumerate(tags):
    # print tag;
    print i;
    ans = tag.get('href',None)
    print ans;
    print "\n";

Answer 3

Gopal Chitalia的答案对我不起作用，因此我决定发布有效的（针对python 3.6.5）

# from BeautifulSoup import *
from bs4 import BeautifulSoup
# import requests
import urllib
url = 'http://www.espn.com/nfl/scoreboard/_/year/2016/seasontype/1/week/2'

# advanced = url
html = urllib.request.urlopen(url)
# urlopen(url).read()
# r = requests.get(html)
# data = r.text
soup = BeautifulSoup(html)

tags = soup('a')

# for link in soup.find_all('a'):
for i,tag in enumerate(tags):
    # print tag;
    print (i);
    ans = tag.get('href',None)
    print (ans);
    print ("\n");

从页面上的按钮刮取链接

3 个答案: