R从网页中提取表格

时间:2015-01-19 18:37:43

标签: r web-scraping

我想通过列表提取"匹配匹配"来自

的表格
http://stats.espncricinfo.com/ci/engine/player/50710.html?class=2;template=results;type=batting;view=match

我是R的新手,所以不太了解从网页中提取数据。我用这段代码来提取表格。

fileUrl<- "http://stats.espncricinfo.com/ci/engine/player/50710.html?class=2;template=results;type=batting;view=match"
#load
sanga <-htmlTreeParse(fileUrl, useInternal=T)
sanga.data <-xpathSApply(sanga,"//tr[@class='data1']",xmlValue)

但是我最终得到一个列矩阵,其中原始表中的每一列都表示为一行。我读了这个帖子中的信息,但仍然无法弄清楚如何以表格格式获取数据。 Scraping html tables into R data frames using the XML package

1 个答案:

答案 0 :(得分:0)

您需要对列名称进行一些操作(并删除NA'spacer'列),但使用正确的XPath可以直接找到所需的表格:

library(rvest)
library(magrittr)

pg <- html("http://stats.espncricinfo.com/ci/engine/player/50710.html?class=2;template=results;type=batting;view=match")

pg %>% 
  html_nodes(xpath="//tr[@class='data1']/../..") %>%  # get to a reasonable set of tables (there are many)
  extract2(2) %>%                                     # we want the second one
  html_table(header=TRUE, trim=TRUE) -> data          # there's a header and pls trim the blanks

str(data)
## data.frame':  397 obs. of  11 variables:
##  $ Bat1      : chr  "35" "85" "36*" "DNB" ...
##  $ Runs      : chr  "35" "85" "36" "-" ...
##  $ BF        : chr  "55" "116" "47" "-" ...
##  $ SR        : chr  "63.63" "73.27" "76.59" "-" ...
##  $ 4s        : chr  "4" "11" "3" "-" ...
##  $ 6s        : chr  "0" "0" "0" "-" ...
##  $           : logi  NA NA NA NA NA NA ...
##  $ Opposition: chr  "v Pakistan" "v South Africa" "v Pakistan" "v South Africa" ...
##  $ Ground    : chr  "Galle" "Galle" "Colombo (RPS)" "Colombo (SSC)" ...
##  $ Start Date: chr  "5 Jul 2000" "6 Jul 2000" "9 Jul 2000" "11 Jul 2000" ...
##  $           : chr  "ODI # 1603" "ODI # 1604" "ODI # 1608" "ODI # 1610" ...