网页在绕过“重要信息页面”的同时进行爬网'

时间:2014-07-14 04:39:22

标签: python r validation web-scraping web-crawler

我想使用pythonR提取以下链接的基金价格:

http://www.mpf.invesco.com.hk/html/en/mpf/prices.html

但每次我在浏览器中加载页面时,它会将我重定向到下面的页面,以确认我已经阅读了重要信息,然后才能获得基金价格。

http://www.mpf.invesco.com.hk/html/en/mpf/information.html

我想那个重要的信息页面'是由javascript制作的。我可以使用Rpython确认已阅读重要信息,并让它检索后续页面的基金价格吗?

2 个答案:

答案 0 :(得分:1)

情况稍微简单一些。您需要的表格是“坐在”从this url加载的iframe内。

以下是使用requests获取并使用BeautifulSoup进行解析的方法:

from bs4 import BeautifulSoup
import requests

URL = 'https://apps.ap.invesco.com/invee/fund_info/fund_price_ns_mpf.do?version=en&haaccount=N&url=http://www.mpf.invesco.com.hk/html/pdf/factsheets/mpf'
response = requests.get(URL)

soup = BeautifulSoup(response.content)
table = soup.find_all('table')[1]

# getting the first row for example
print table.tr.text.strip()

打印:

Valuation Date: 10/07/2014

仅供参考,此处selenium和真实浏览器不需要。

答案 1 :(得分:1)

使用RSeleniumphantomjs

# use dev version so we can run phantomjs without a selenium server
# devtools::install_github("ropensci/RSelenium")
# it is necessary that phantomjs is in your PATH if not
# refer to package vignettes

library(RSelenium)
appURL <- "http://www.mpf.invesco.com.hk/html/en/mpf/prices.html"
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
# <span onclick=\"accept();return false;\">I have read the Important Information</span>
# execute above code 
remDr$executeScript("accept();return false;")
# switch to iframe element
remDr$switchToFrame("myFrame")

> head(readHTMLTable(remDr$getPageSource()[[1]]
                     , which = 2, header = TRUE, skip.rows = 1))

Name of Constituent Fund Unit Class Currency
1                                                 Hong Kong and China Equity Fund          A      HKD
2                                                               Asian Equity Fund          A      HKD
3                                                                     Growth Fund          A      HKD
4                                                                   Balanced Fund          A      HKD
5 RMB Bond Fund (this Constituent Fund is denominated in HKD only and not in RMB)          A      HKD
6                                                             Capital Stable Fund          A      HKD
Fund Price
1    34.5537
2    10.2323
3    19.2199
4    18.8244
5     9.8299
6    18.3871

最后完成后关闭phantomjs实例:

pJS$stop()