Webscraping angularjs网站

时间:2017-06-06 22:48:55

标签: python selenium web-scraping beautifulsoup

我正在尝试使用beautifulsoup webscrape angularjs网站。该网站是一个angularjs网站,完全由javascript生成。

该网站是:https://sports.bovada.lv/baseball/mlb/pitcher-props-market-group

我以为我可以使用phantomjs webdriver策略。 这就是我所拥有的:

PHANTOMJS_PATH = './phantomjs.exe'
bovadaURL = 'https://sports.bovada.lv/baseball/mlb/pitcher-props-market-group'
driver = webdriver.PhantomJS(PHANTOMJS_PATH)
driver.get(bovadaURL)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(15) # wait to load
# now print the response
print(driver.page_source)

然而,没有得到所需的输出..这输出:

<html><head></head><body></body></html>

关于从哪里开始的任何想法?耗尽了想法..

1 个答案:

答案 0 :(得分:0)

你试过请求吗?我刚刚尝试了一个快速而又脏的脚本,它的标签超过了<html><head><body>

#!/usr/bin/python3

import requests, bs4

res = requests.get('https://sports.bovada.lv/baseball/mlb/pitcher-props-market-group')
soup = bs4.BeautifulSoup(res.text,'html.parser')

print(res.text)

插入print语句来测试响应,这是输出:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"... more
<!--[if IE]><![endif]-->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="x-dns-prefetch-control" content="on">... more
<meta name="theme-color" content="#ffffff">
<link href="https://sports.bovada.lv/base... more
etc... much longer html stuff

和bs4似乎也很顺利,如果我只是快速做一些事情,比如查找所有链接(他们使用<link>标签),那么:

#!/usr/bin/python3

import requests, bs4

res = requests.get('https://sports.bovada.lv/baseball/mlb/pitcher-props-market-group')

soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('link')

for link in links:
    print(link.attrs['href'])

产生以下输出:

>python test.py
//cdn13-a.imagestore.lv
//cdn13-b.imagestore.lv
//cdn13-c.imagestore.lv
https://cdn13-a.imagestore.lv/sites/site10/themes/websites_bovada_theme/favicon.ico
https://cdn13-a.imagestore.lv/static/site10/favicons/apple-icon-57x57.png
https://cdn13-b.imagestore.lv/static/site10/favicons/apple-icon-60x60.png
https://cdn13-b.imagestore.lv/static/site10/favicons/apple-icon-72x72.png
https://cdn13-c.imagestore.lv/static/site10/favicons/apple-icon-76x76.png
https://cdn13-b.imagestore.lv/static/site10/favicons/apple-icon-114x114.png
etc...

有帮助吗?

编辑: 以前你不能使用无头浏览器。所以你需要使用像firefox这样的常规内容。但首先(如果你还没有这样做)你需要得到geckodriver,在这里找到它:

https://github.com/mozilla/geckodriver/releases

您需要将此添加到PATH中。 一旦完成,你应该能够在该网站上运行selenium并像往常一样再次通过bs4。

enter image description here