如何从 R 中的网站抓取表格

时间:2021-03-14 12:33:22

标签: r web-scraping rvest

我想从 https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1 中提取底部表格(“每日观察”)。我得到了表组件的完整 xpath,但它显示 {xml_nodeset (0)} 作为输出。我在这里做错了什么?我使用了以下代码:

library(rvest)
single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')  
single %>%
  html_nodes(xpath = '/html/body/app-root/app-history/one-column-layout/wu-header/sidenav/mat-sidenav-container/mat-sidenav-content/div/section/div[2]/div/div[5]/div/div/lib-city-history-observation/div/div[2]/table')

看起来表格组件是空的。

1 个答案:

答案 0 :(得分:2)

这是一个动态页面,表格由Javascript生成。 仅 rvest 是不够的。尽管如此,您还是可以从 JSON API 获取源内容。

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <- 
  read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <- 
  html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
  html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
  "https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
  str_apikey,
  "&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <- 
  json_obj$observations %>% as_tibble() %>% 
  mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>% 
                          with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
  select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table

结果:一张漂亮的桌子

# A tibble: 34 x 8
   valid_time_gmt       temp dewPt    rh wdir_cardinal gust  pressure precip_hrly
   <dttm>              <int> <int> <int> <chr>         <lgl>    <dbl>       <dbl>
 1 2010-12-31 23:52:00    38    NA    79 CALM          NA        30.1          NA
 2 2011-01-01 00:52:00    35    31    85 CALM          NA        30.1          NA
 3 2011-01-01 01:52:00    36    31    82 CALM          NA        30.1          NA
 4 2011-01-01 02:52:00    37    31    79 CALM          NA        30.1          NA
 5 2011-01-01 03:52:00    36    30    79 CALM          NA        30.1          NA
 6 2011-01-01 04:52:00    37    30    76 NNE           NA        30.1          NA
 7 2011-01-01 05:52:00    36    30    79 CALM          NA        30.1          NA
 8 2011-01-01 06:52:00    34    30    85 CALM          NA        30.1          NA
 9 2011-01-01 07:52:00    37    31    79 CALM          NA        30.1          NA
10 2011-01-01 08:52:00    44    38    79 CALM          NA        30.1          NA
# ... with 24 more rows