Question

关于：how to get information within <meta name...> tag in html using htmlParse and xpathSApply

的答案

我的问题：

html <- htmlParse(domain, useInternalNodes=T);
names <- html['//meta/@name']
content <- html['//meta/@content']

cbind(names, content)

页面中的元标记是：

<meta name="description" content="blah, blah...." />
<meta name="keywords" content="keyword1, keyword2" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="google-site-verfication" content="1234jalsdkfjasdf928374-293423" />

我发现的是：

 length(names)
[1] 3

length(content)
[1] 4

names                                     content
[1, ] "description"                       [1, ] "blah, blah...."
[2, ] "keywords"                          [2, ] "keyword1, keyword2"
[3, ] "google-site-verification"          [3, ] "text/html; charset=UTF-8"
[4, ] "description"                       [4, ] "1234jalsdkfjasdf928374-293423"

似乎解析器正在绊倒＆＃34; http-equiv＆＃34;并返回下一行代码行＆＃34; google-site-verification＆＃34;但仍然返回＆＃34;内容＆＃34;对于＆＃34; http-equiv＆＃34;，然后因为没有更多＆＃34;名称＆＃34; cbind正在回顾＆＃34;描述＆＃34;再次匹配最后一行内容，即实际的＆＃34; google-site-verification＆＃34;。看起来像一个简单的修复，到目前为止我做的任何条件都不起作用，我怎么能做到这一点？

Answer 1

我意识到你想出了你需要的东西（它与原来的q并不匹配）但我们将把StackOverflow.com作为一个例子，因为我把它编码为无论如何作为我的orignal答案的补充：< / p>

library(XML)

doc <- htmlParse("http://stackoverflow.com/", useInternalNodes=TRUE)

具有以下<meta>标记：

<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta property="og:type" content="website" />
<meta property="og:image" itemprop="image primaryImageOfPage" content="http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6" />
<meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow" />
<meta name="twitter:description" property="og:description" itemprop="description" content="Q&amp;A for professional and enthusiast programmers" />
<meta property="og:url" content="http://stackoverflow.com/"/>

并非每个标记都有name属性，实际上是7，只有4个：

length(doc["//meta/@property"])
## [1] 4

请注意，与执行相同：

length(xpathSApply(doc, "//meta/@name"))
## [1] 4

这几乎就是在幕后发生的事情。

只会在搜索中出现真实情况。如果你这样做，你可以看到更多的布局：

xpathSApply(doc, "//meta", xmlGetAttr, "name")

## [[1]]
## [1] "twitter:card"
## 
## [[2]]
## [1] "twitter:domain"
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## [1] "twitter:title"
## 
## [[6]]
## [1] "twitter:description"
## 
## [[7]]
## NULL

该列表在转换为向量时，由于NULL而截断为4个条目。 rvest（原始答案`在提取方面只是“更聪明”。

原始回答

使用rvest，您可以非常快速地将所有<meta>属性捕获到数据框中（如果这是您正在尝试执行的操作）：

library(rvest)
library(dplyr)

pg <- html("http://facebook.com/")

all_meta_attrs <- unique(unlist(lapply(lapply(pg %>% html_nodes("meta"), html_attrs), names)))

dat <- data.frame(lapply(all_meta_attrs, function(x) {
  pg %>% html_nodes("meta") %>% html_attr(x)
}))

colnames(dat) <- all_meta_attrs

glimpse(dat)

## Observations: 19
## Variables:
## $ charset    (fctr) utf-8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ http-equiv (fctr) NA, refresh, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ content    (fctr) NA, 0; URL=/?_fb_noscript=1, default, Facebook, h...
## $ name       (fctr) NA, NA, referrer, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ id         (fctr) NA, NA, meta_referrer, NA, NA, NA, NA, NA, NA, NA...
## $ property   (fctr) NA, NA, NA, og:site_name, og:url, og:image, og:lo...

但它也会可靠地为您提取属性：

pg %>% html_nodes("meta") %>% html_attr("http-equiv")

##  [1] NA                "refresh"         NA               
##  [4] NA                NA                NA               
##  [7] NA                NA                NA               
## [10] NA                NA                NA               
## [13] NA                NA                NA               
## [16] NA                NA                NA               
## [19] "X-Frame-Options"

Answer 2

所以我想通了，至少我到底要去做什么。最终，我需要提取“关键字”和“描述”。需要更改的代码片段是：

这......

html <- htmlParse(domain, useInternalNodes=T);
names <- html['//meta/@name']
content <- html['//meta/@content']

到此......

html <- htmlParse(domain, useInternalNodes=T);
**keywords <- html['//meta[@name="keywords"]/@content']
description <- html['//meta[@name="description"]/@content']**

干杯

使用xml和r解析元名/内容

2 个答案: