更改下拉菜单,然后使用rvest或httr抓取数据

时间:2018-09-15 22:27:58

标签: r

我正在从https://rotogrinders.com/lineups/nfl?site=draftkings抓取数据。目前,我使用myData <- read_html("https://rotogrinders.com/lineups/nfl?site=draftkings")引入数据,然后使用html_nodes提取想要的数据。我正在尝试更改板岩选择菜单,然后获取数据。我要更改的菜单的XPath是//select[@name='slate_name']

我的研究使我相信我需要实现以下功能之一,但是我不确定该如何去做,因为菜单不是表单形式,也没有提交按钮...页面选择新选项后会自动重新加载:

httr::post
rvest::html_session
Rselenium

我对Rselenium库不熟悉,因此理想情况下,我正在寻找使用httrrvest的解决方案。

1 个答案:

答案 0 :(得分:6)

您已经通过read_html()获得了所有信息。 slate-name下拉列表仅通过java-script过滤计划。我建议您获取所有数据并自行过滤。希望有帮助。

library(magrittr)
library(rvest)
#> Lade nötiges Paket: xml2

url <- "https://rotogrinders.com/lineups/nfl?site=draftkings"
myData <- read_html(url) 

myData %>%
  html_nodes(".teams") %>%
  html_text() %>%
  stringr::str_squish()
#>  [1] "New York NYJ Jets Cleveland CLE Browns"          
#>  [2] "New Orleans NOS Saints Atlanta ATL Falcons"      
#>  [3] "Buffalo BUF Bills Minnesota MIN Vikings"         
#>  [4] "Denver DEN Broncos Baltimore BAL Ravens"         
#>  [5] "Indianapolis IND Colts Philadelphia PHI Eagles"  
#>  [6] "Cincinnati CIN Bengals Carolina CAR Panthers"    
#>  [7] "San Francisco SFO 49ers Kansas City KCC Chiefs"  
#>  [8] "Green Bay GBP Packers Washington WAS Redskins"   
#>  [9] "Oakland OAK Raiders Miami MIA Dolphins"          
#> [10] "New York NYG Giants Houston HOU Texans"          
#> [11] "Tennessee TEN Titans Jacksonville JAC Jaguars"   
#> [12] "Los Angeles LAC Chargers Los Angeles LAR Rams"   
#> [13] "Chicago CHI Bears Arizona ARI Cardinals"         
#> [14] "Dallas DAL Cowboys Seattle SEA Seahawks"         
#> [15] "New England NEP Patriots Detroit DET Lions"      
#> [16] "Pittsburgh PIT Steelers Tampa Bay TBB Buccaneers"

reprex package(v0.2.1)于2018-09-22创建

编辑 您仍然可以通过read_html()获得所有相关信息。您需要从下拉列表中获取ID,然后使用所有薪水解析Java脚本字符串。我做了第一部分,其余部分由您决定;-)

library(tidyverse, quietly = TRUE)
library(rvest, warn.conflicts = FALSE)
#> Lade nötiges Paket: xml2

url <- "https://rotogrinders.com/lineups/nfl?site=draftkings"
raw <- read_html(url) 

# helper function
parse_json <- function(x) tibble(name = x$name, importID = x$importId)

# get id from slates
raw %>%
  html_nodes(".slate-data") %>%
  html_attr(name = "value") %>%
  jsonlite::fromJSON() %>%
  purrr::map_df(parse_json)
#> # A tibble: 10 x 2
#>    name                                                importID
#>    <chr>                                               <chr>   
#>  1 1:00pm: Classic: 13 Games                           21505   
#>  2 8:20pm: Classic (Thu-Mon): 16 Games                 21576   
#>  3 1:00pm: Classic (Sun-Mon): 15 Games                 21586   
#>  4 1:00pm: Tiers (NFL Tiers): 14 Games                 21589   
#>  5 1:00pm: Classic (Early Only): 10 Games              21581   
#>  6 4:05pm: Classic (Afternoon Only): 3 Games           21630   
#>  7 4:25pm: Classic (Afternoon Turbo): 2 Games          21631   
#>  8 8:20pm: Classic (Primetime): 2 Games                21645   
#>  9 4:25pm: Showdown Captain Mode (DAL vs SEA): 1 Games 21632   
#> 10 8:20pm: Showdown Captain Mode (NE vs DET): 1 Games  21644

raw %>%
  html_nodes(".select") %>%
  html_nodes("script") %>%
  html_text() %>%
  stringr::str_squish() %>%
  substr(1, 1000)
#> [1] "window.slateSelect = window.createReactComponent(SlateSelectRadnor, { slates: {\"All Games\":{\"games\":[{\"scheduleId\":\"45755\",\"teamAwayId\":\"12\",\"teamHomeId\":\"3\"},{\"scheduleId\":\"45756\",\"teamAwayId\":\"23\",\"teamHomeId\":\"21\"},{\"scheduleId\":\"45757\",\"teamAwayId\":\"9\",\"teamHomeId\":\"8\"},{\"scheduleId\":\"45758\",\"teamAwayId\":\"25\",\"teamHomeId\":\"1\"},{\"scheduleId\":\"45759\",\"teamAwayId\":\"14\",\"teamHomeId\":\"19\"},{\"scheduleId\":\"45760\",\"teamAwayId\":\"2\",\"teamHomeId\":\"22\"},{\"scheduleId\":\"45761\",\"teamAwayId\":\"31\",\"teamHomeId\":\"26\"},{\"scheduleId\":\"45762\",\"teamAwayId\":\"7\",\"teamHomeId\":\"20\"},{\"scheduleId\":\"45763\",\"teamAwayId\":\"27\",\"teamHomeId\":\"10\"},{\"scheduleId\":\"45764\",\"teamAwayId\":\"18\",\"teamHomeId\":\"13\"},{\"scheduleId\":\"45765\",\"teamAwayId\":\"16\",\"teamHomeId\":\"15\"},{\"scheduleId\":\"45766\",\"teamAwayId\":\"28\",\"teamHomeId\":\"30\"},{\"scheduleId\":\"45767\",\"teamAwayId\":\"5\",\"teamHomeId\":\"29\"},{\"scheduleId\":\"45768\",\"teamAwayId\":\"17\",\"teamHomeId\":\"32\"},{\"scheduleId\":\"45769\",\"teamAwayId\":\"11\",\"teamHomeId\":\"6\"},{\"scheduleId\":\"45770\","

reprex package(v0.2.1)于2018-09-23创建