XPath查询以查找整个HTML文档中的所有未标记文本

时间:2016-02-04 15:37:13

标签: xml xpath css-selectors

鉴于以下HTML,是否有XPath查询将提取两个<h2>标记之间的所有已标记和未标记的文本? (我在RStudio中使用RSelenium包。)

<html>
    <h2 id="section1" class="article">Heading 1</h2>
    <h3 id="section1.1" class="article">Subheading 1</h3>
    <p id="para001"  class="article section clear">
           Paragraph text 1.</p> 
    <div id="formula1" class="formula">...<img />...</div>
           Untagged text 1.
    <sub>  Subscripted text. </sub>
           Untagged text 2. 
    <em>   Emphasized text. </em>
           Untagged text 3.
    <span id="bib"> Bibliography text. </span>
           Untagged text 4.
    <p id="para002" class="article section clear">
           Paragraph text 2.</p>
    <h3 id="section1.2" class="article">Subheading 2</h3>
    <p id="para003" class="article section clear">
           Paragraph 3 text.</p>
    <h3 id="section1.3" class="article">Subheading 3</h3>
    <p id="para004" class="article section clear">
           Paragraph 4 text.</p>
    <h2 id="section2" class="article">Heading 2</h2>       
</html>

我正在尝试提出一个将返回的查询:

Paragraph text 1.
Untagged text 1.
Subscripted text.
Untagged text 2. 
Emphasized text.
Untagged text 3.
Bibliography text.
Untagged text 4.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4. 

到目前为止我尝试过的是,

//p[preceding-sibling::h2[@id='section1'] 
    and following-sibling::h2[@id='section2'] 
    and descendant::node()]

返回,

Paragraph text 1.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4.

我尝试使用this question的解决方案,但我的问题有点复杂。我尝试添加following-sibling::text()[1],但它不会提取未标记的文本。如果没有一个好的XPath解决方案,那么我很乐意欢迎像CSS选择器这样的替代方法。

1 个答案:

答案 0 :(得分:2)

嗯,首先你不想只过滤p标签(这是第三个字母中的p),你想要第1节之后和第2节之前的所有标签。第二,你正在寻找这两个文本节点之间标签的所有后代。

所以:查找包含preceding-sibling::h2[@id='section1']following-sibling::h2[@id='section2']的所有代码:

//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]

然后查找以下所有text() - 标签:

//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]//text()
相关问题