Xpath:直到节点的所有节点(Wikiquote.org)

时间:2012-12-16 16:33:52

标签: ruby xpath nokogiri xpath-2.0 wikimedia

文件: http://en.wikiquote.org/wiki/The_Matrix

我想获得第一部分的所有引号(// ul / li)(Neo的引号)。

我不能//ul[1]/li,因为在某些wikiquote的页面中,引用以此形式表示

<h2><span class="mw-headline" id="Neo">Neo</span></h2>  

<ul>
 <li> First quote </li>
</ul> 

<ul>
 <li> Second quote </li>
</ul> 

<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  

而不是

<ul>
     <li> First quote </li>
     <li> Second quote </li>
</ul>

我试过这个来获得第一部分

(//*[@id='mw-content-text']/ul/preceding-sibling::h2/span[@class='mw-headline'])[1]

但我有问题只得到第一部分的引用。你可以帮助我吗?

3 个答案:

答案 0 :(得分:2)

使用

(//h2[span/@id='Neo'])[1]/following-sibling::ul
  [count(.
        |
         (//h2[span/@id='Neo'])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  =
   count((//h2[span/@id='Neo'])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  ]
   /li

这将选择紧跟第一个li的所有h2,其中span子项的id属性值为“Neo”。

要选择第二个h2的qoutatations,只需将上述表达式1替换为2

对所有数字执行此操作:1,2, ..., count(//h2[span/@id='Neo'])

基于XSLT的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "(//h2[span/@id='Neo'])[1]/following-sibling::ul
      [count(.
            |
             (//h2[span/@id='Neo'])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      =
       count((//h2[span/@id='Neo'])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      ]
        /li

   "/>
 </xsl:template>
</xsl:stylesheet>

在提供的XML文档上应用此转换时:

<html>
 <h2><span class="mw-headline" id="Neo">Neo</span></h2>

 <ul>
  <li> First quote </li>
 </ul>

 <ul>
  <li> Second quote </li>
 </ul>

 <h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  >
</html>

评估XPath表达式,并将选定的节点复制到输出中:

<li> First quote </li>
<li> Second quote </li>

<强>解释

以下是Kayessian(Michael Kay博士)关于两个节点集交集的公式:

$ns1[count(.|$ns2) = count($ns2)]

以上选择的所有节点都属于节点集$ns和节点集$ns2

因此,我们将$ns1替换为由感兴趣的ul的所有后续兄弟h2组成的节点集。我们将$ns2替换为ul的所有前面兄弟h2组成的节点集,该h2是感兴趣的ul的兄弟之后的直接(第一个)。

这两个节点集的交集包含所有需要的(//h2[span/@id=$vSectionId])[1] /following-sibling::ul [count(. | (//h2[span/@id=$vSectionId])[1] /following-sibling::h2[1] /preceding-sibling::ul ) = count((//h2[span/@id=$vSectionId])[1] /following-sibling::h2[1] /preceding-sibling::ul ) ] /li 元素。


更新:在评论中,OP声明他只知道他希望结果来自第一部分 - 字符串“Neo”未知。

以下是修改后的解决方案

$vSectionId

必须获取变量 substring(//div[h2='Contents'] /following-sibling::ul[1] /li[1]/a/@href, 2) 作为以下XPath表达式的字符串值:

id

我们从第一个目录表条目中href的{​​{1}}获取所需的a,并跳过第一个字符“#”。

这是基于XSLT的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:variable name="vSectionId" select=
 "substring(//div[h2='Contents']
                      /following-sibling::ul[1]
                         /li[1]/a/@href,
                    2)
 "/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "(//h2[span/@id=$vSectionId])[1]
                /following-sibling::ul
      [count(.
            |
             (//h2[span/@id=$vSectionId])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      =
       count((//h2[span/@id=$vSectionId])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      ]
        /li

   "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于以下位置的完整XML文档时: http://en.wikiquote.org/wiki/The_Matrix,应用这两个XPath表达式的结果(替换第二个表达式中的第一个结果,然后评估第二个表达式)是想要的,正确的

<li>I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.</li>
<li>Whoa.</li>
<li>I know kung-fu.</li>
<li>Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.</li>
<li>Guns.. lots of guns...</li>
<li>There is no spoon.</li>
<li>My name...is Neo!</li>

答案 1 :(得分:2)

使用API​​将使解析变得更容易。这是一个将拉出第一部分的查询:

http://en.wikiquote.org/w/api.php?action=parse&page=The_Matrix&section=1&prop=wikitext

输出:

<?xml version="1.0"?>
<api>
  <parse title="The Matrix">
    <wikitext xml:space="preserve">== Neo ==
[[File:The.Matrix.glmatrix.2.png|thumb|right|Unfortunately, no one can be ''told'' what The Matrix is. You have to see it for yourself.]]
[[Image:Arty spoon.jpg|thumb|right|Do not try to bend the spoon — that's impossible. Instead, only try to realize the truth: there is no spoon.]]

* I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.

* Whoa.
* I know kung-fu.

* Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.

* Guns.. lots of guns...

* There is no spoon. 

* My name...is Neo!</wikitext>
  </parse>
</api>

以下是解析此问题的一种方法(使用HTTParty):

require 'httparty'

class Wikiquote
  include HTTParty
  base_uri 'en.wikiquote.org/w/'

  def self.get_quotes(page)
    url = "/api.php?action=parse&page=#{page}&section=1&prop=wikitext&format=xml"
    headers = {"User-Agent" => "Wikiquote scraper 1.0"}
    content = get(url, headers: headers)['api']['parse']['wikitext']['__content__']
    return content.scan(/^\* (.*)$/).flatten
  end
end

用法:

Wikiquote.get_quotes("The_Matrix")

输出:

["I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.",
 "Whoa.",
 "I know kung-fu.",
 "Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.",
 "Guns.. lots of guns...",
 "There is no spoon. ",
 "My name...is Neo!"]

答案 2 :(得分:1)

我建议//ul[preceding-sibling::h2[1][span/@id = 'Neo']]/li。或者,如果id属性也不分别与搜索无关,那么根据评论中的答案,我认为您想要

(//h2[span[contains(@class, 'mw-headline')]])[1]/following-sibling::ul
   [1 = count(preceding-sibling::h2[1] | (//h2[span[contains(@class, 'mw-headline')]])[1])]/li

请参阅XPath axis, get all following nodes until获取解释,我希望我已设法正确关闭所有括号和括号,现在没时间进行测试。