XPath选择祖父母和特定的叔叔节点

时间:2014-10-31 14:19:26

标签: xml r xpath

我在R中使用XPath,并且具有这样的XML结构:

library(XML)

xml1 <- xmlParse('
<L0>
    <L1>
        <ID>Get this ID</ID>
        <L1N1>Ignore node 1</L1N1>
        <L1N2>
            <L2>
                <L2N1>Get this node and all others in L2</L2N1>
            </L2>
        </L1N2>
        <L1N3>Ignore node 3</L1N3>
    </L1>
    <L1>
        <ID>Get this ID</ID>
        <L1N1>Ignore node 1</L1N1>
        <L1N2>
            <L2>
                <L2N1>Get this node and all others in L2</L2N1>
            </L2>
        </L1N2>
        <L1N4>Ignore node 4</L1N4>
    </L1>
    <L1>
        <ID>Ignore this ID</ID>
        <L1N1>Ignore node 1</L1N1>
        <L1N3>Ignore node 3</L1N3>
        <L1N4>Ignore node 4</L1N4>
    </L1>
</L0>
                 ')

我想提取每个L2节点和一个叔叔节点(例如ID),但不提取其他叔叔。每个提取的结果应该返回到祖父节点L1。这是期望的输出

## [[1]]
## <L1>
##   <ID>Get this ID</ID>
##   <L1N2>
##     <L2>
##       <L2N1>Get this node and all others in L2</L2N1>
##     </L2>
##  </L1N2>
## </L1> 

## [[2]]
## <L1>
##   <ID>Get this ID</ID>
##   <L1N2>
##     <L2>
##       <L2N1>Get this node and all others in L2</L2N1>
##     </L2>
##   </L1N2>
## </L1>

我可以获得包含L1后代的L2个节点:

getNodeSet(xml1, "//L1[descendant::L2]")
## [[1]]
## <L1>
##   <ID>Get this ID</ID>
##   <L1N1>Ignore node 1</L1N1> ## *Want to exclude this*
##   <L1N2>
##     <L2>
##       <L2N1>Get this node and all others in L2</L2N1>
##     </L2>
##   </L1N2>
##   <L1N3>Ignore node 3</L1N3> ## *Want to exclude this*
## </L1> 
## 
## [[2]]
## <L1>
##   <ID>Get this ID</ID>
##   <L1N1>Ignore node 1</L1N1> ## *Want to exclude this*
##   <L1N2>
##     <L2>
##       <L2N1>Get this node and all others in L2</L2N1>
##     </L2>
##   </L1N2>
##   <L1N4>Ignore node 4</L1N4> ## *Want to exclude this*
## </L1>

......但这包括我不想要的叔叔。我可以排除那些叔叔并选择我想要的L1子节点:

getNodeSet(xml1, "//L1/*[self::ID | child::L2]")
## [[1]]
## <ID>Get this ID</ID> 
##   
## [[2]]
## <L1N2>
##   <L2>
##     <L2N1>Get this node and all others in L2</L2N1>
##   </L2>
## </L1N2> 
## 
## [[3]]
## <ID>Get this ID</ID> 
##   
## [[4]]
## <L1N2>
##   <L2>
##     <L2N1>Get this node and all others in L2</L2N1>
##   </L2>
## </L1N2> 
## 
## [[5]]
## <ID>Ignore this ID</ID>

...但现在IDL2是分开的,而不是L1,它还包括来自第三个L1节点的元素没有L2

XPath可以返回所需的结果吗?如果没有,我可以在R中使用其他东西来实现结果吗?

1 个答案:

答案 0 :(得分:1)

这似乎做你想要的(使用你的xml1):

trim <- function(node) {
  names     <- names(node)
  to.remove <- names[!(names %in% c("ID","L1N2"))]
  removeChildren(node,kids=to.remove)
}
lapply(xml1["//L1[descendant::L2]"],trim)
#  [[1]]
# <L1>
#   <ID>Get this ID</ID>
#   <L1N2>
#     <L2>
#       <L2N1>Get this node and all others in L2</L2N1>
#     </L2>
#   </L1N2>
# </L1> 
# 
# [[2]]
# <L1>
#   <ID>Get this ID</ID>
#   <L1N2>
#     <L2>
#       <L2N1>Get this node and all others in L2</L2N1>
#     </L2>
#   </L1N2>
# </L1> 

当然你可以使用匿名函数并把它放在一行:

lapply(xml1["//L1[descendant::L2]"],function(node) removeChildren(node,kids=names(node)[!(names(node)%in%c("ID","L1N2"))]))