Question

我正在努力争取高等教育时代的机会

我使用了以下代码，但结果是一个空表。我在做什么错了？

pacman::p_load(rvest)

webpage <- read_html(paste0('https://www.timeshighereducation.com/rankings/', 
                            'united-states/2018#!/page/0/length/-1/sort_by/', 
                            'stats_salary/sort_order/desc/cols/stats'))


d <- html_nodes(webpage, xpath = '//table') %>% 
  html_table()

d

[[1]]
 [1] rank order           Rank                  Name                  Node ID              
 [5] Overall                                     Resources                                  
 [9] Engagement                                  Outcomes                                   
[13] Environment                                                                            
[17]                                                                                        
[21] Tuition and Fees      Room and Board        Salary after 10 years
<0 rows> (or 0-length row.names)

Answer 1

我找到了数据！事实证明，timeshighereducation.com使用JavaScript来调用数据，因此使用典型的RVest例程将不起作用。

我发现下面的链接对于查看如何使用javascript显示数据的网页很有用：rvest and V8

我的第一步是查看哪个节点返回我想要的脚本。它似乎是列表中的9。然后，我将其转换为html文本。

t <- html_nodes(webpage, 'script') %>% 
  '['(9) %>% 
  html_text()

进一步检查html文本后，我发现脚本中有一个json文件。如果我在Chrome浏览器中输入网址，我实际上可以看到数据。

因此，使用许多处理JSON的可用R包来获取数据似乎很容易。我选择了jsonlite。这很容易，只需5行代码即可获取数据。我现在很高兴：）

library(jsonlite)
college_json <- fromJSON(paste0(
  'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/', 
  'united_states_rankings_2018_limit0_efdb24148bae97278bbfe6ecfd71cdd9.json'))

college_dat <- college_json$data

使用rvest抓取网页

1 个答案: