使用R从网页中提取元描述

时间:2016-06-17 01:00:23

标签: r rvest httr

您好我正在尝试检索这些wepages元描述

来自网页来源"

Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html", 
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))

期望的输出

Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook", 
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))

我试图用httr来完成这个任务,但是我无法以所需的输出格式获取它或者从使用GET命令检索的内容中提取内容

library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url        : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers    :List of 22
..$ server                     : chr "Apache/2.2"

我需要从源代码中提取的字段位于此字符串

之后
<meta itemprop="description" content="

喜欢这样

<meta itemprop="description" content="&#039;Spam King&#039; 
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages" 

1 个答案:

答案 0 :(得分:6)

你真的只需要rvest。由于他们是所有<h1>标题,您可以遍历网址列表,挑选标题:

library(rvest)

sapply(Data$Pages, 
       function(url){
           url %>% 
               as.character() %>%   # in case strings are stored as factors
               read_html() %>% 
               html_nodes('h1') %>% 
               html_text()
           })

# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"                                         
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting" 

或者,如果你真的想要抓取<meta>标签,你可以用同样的方式来做,虽然选择器更加痛苦:

sapply(Data$Pages, function(url){
    url %>% 
        as.character() %>% 
        read_html() %>% 
        html_nodes(xpath = '//meta[@itemprop="description"]') %>% 
        html_attr('content')
    })

无论哪种方式都会得到相同的结果。