Question

我正在尝试从网站的一部分中提取文本。包含文本的div节点还包含几个子节点，每个子节点都有自己的文本或其他内容。但是，我只希望来自顶级节点的文本不是来自其子节点！

以下是相关网页部分的显示方式：

    <div class="body-text">
       <div id="other" class="other"></div>
       <div id="other2" class="other2"></div>
       <div id="other3" class="other3"> 
           <span>irrelevant text</span>
        </div>

       <h2>heading2</h2>

       -Text which I want to get. There are also text parts which are linked.

    </div>

这是我的代码，它让我感到“凌乱”的文字。我试过了/text()，但只要它的一部分被链接，这就会截断我的文本。所以我不能用它。我也试过了/div/node()[not(self::div)]，但没有设法让它发挥作用。有人可以帮忙吗？

webpage = getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')

body <- xpathSApply(pagetree, "//div[@class='body-text']", xmlValue)

Answer 1

1）发布示例

尝试在text()分区内搜索a/text()或body-text的节点，删除任何仅包含空格的普通节点：

## input

Text <- '<div class="body-text">
       <div id="other" class="other"></div>
       <div id="other2" class="other2"></div>
       <div id="other3" class="other3"> 
           <span>irrelevant text</span>
        </div>
       <h2>heading2</h2>
       -Text which I want to get. There are also text parts which are linked.
    </div>'

library(XML)
pagetree <- htmlTreeParse(Text, asText = TRUE, useInternalNodes = TRUE)

## process it - xpth is the Xpath expression and xpathSApply() runs it

trim <- function(x) gsub("^\\s+|\\s+$", "", x) # trim whitespace from start & end

xpth <- "( //div[@class='body-text']/text() | 
   //div[@class='body-text']/a/text() ) [ normalize-space() != '' ]"
txt <- trim(xpathSApply(pagetree, xpth, xmlValue))

结果如下：

> txt
[1] "-Text which I want to get. There are also text parts which are linked."

2）海报在评论中提供的示例。将其用作Text

Text <- '<div class="body-text"> text starts here 
 <a class="footnote" href="link"> text continues here <sup>1</sup> </a> 
 and continues here</div>'

并重复上面的代码：

> txt
[1] "text starts here"    "text continues here" "and continues here"

编辑：根据海报的评论对上述内容进行了修改。主要变化是xpath表达式xpth和最终点，它说明了与评论中的海报提供的示例相同的代码。

编辑：已将过滤从仅空白节点中移出R到Xpath。这稍微延长了Xpath表达式，但是消除了R Filter（）步骤。还简化并略微减少了演示文稿。

Answer 2

这个问题有一些可能的解决方案，但是，首先，有必要澄清您要选择的节点。你说：

我只希望顶级节点的文字不是来自其子节点！

但事实并非如此！文章文本中的所有元素节点（例如a，em,等）本身都是body-text div的子节点。您真正想要做的是选择div的某个部分中找到的所有文本。方便的是，your source document（在上面的评论中链接）包含标记文章开头和结尾的注释节点。它们看起来像这样：

<!-- inizio TESTO -->article text<!-- fine TESTO -->

事实上，你真的只需要开始标记，因为之后没有其他内容。

在开始标记

后选择文本

以下表达式选择所需的节点：

//div[@class='body-text']/comment()[.=' inizio TESTO ']/following::text()

测试以下精简文档：

<div class="body-text">
    <div class="fb-like-button" id="fb-like-head"></div>
    <h2><!-- inizio OCCHIELLO -->IRAN<!-- fine OCCHIELLO --></h2>
    <h1><!-- title -->"A Isfahan colpito sito nucleare"<br/>Londra annuncia azioni dure<!-- fine TITOLO --></h1>
    <h3><!-- summary -->Secondo il<em>Times</em>, fonti di intelligence...<br/><strong><br/></strong><!-- fine SOMMARIO --></h3>
    <div class="sidebar">Sidebar text...</div>
    <!-- inizio TESTO --><strong>TEHERAN</strong> - L'esplosione avvenuta 
    <a href="http://www.repubblica.it" class="footnote">lunedì scorso in Iran a Isfahan <sup>1</sup></a> avrebbe colpito un 
    sito nucleare. Lo hanno riferito fonti dell'intelligence israeliana al quotidiano britannico <em>The Times</em>, secondo le 
    quali alcune immagini satellitari "mostrano chiaramente colonne di fumo e la distruzione" di una struttura nucleare di Isfahan. 
    Sale, intanto, la tensione con la Gran Bretagna: dopo <a href="http://www.repubblica.it" class="footnote">l'assalto all'
    ambasciata britannica <sup>2</sup></a> ieri...<!-- fine TESTO -->
</div>

返回以下文本节点：

[#text: TEHERAN]
[#text:  - L'esplosione avvenuta 
    ]
[#text: lunedì scorso in Iran a Isfahan ]
[#text: 1]
[#text:  avrebbe colpito un 
    sito nucleare. Lo hanno riferito fonti dell'intelligence israeliana al quotidiano britannico ]
[#text: The Times]
[#text: , secondo le 
    quali alcune immagini satellitari "mostrano chiaramente colonne di fumo e la distruzione" di una struttura nucleare di Isfahan. 
    Sale, intanto, la tensione con la Gran Bretagna: dopo ]
[#text: l'assalto all'
    ambasciata britannica ]
[#text: 2]
[#text:  ieri...]
[#text: 
]

这是一个节点集，你可以迭代，等等。我不知道R，所以我不能提供这些细节。

在开始和结束标记之间选择文本

如果在应该排除的结束标记之后可能有内容 - 在提供的示例中没有 - 则使用以下表达式：

//div[@class='body-text']//text()[preceding::comment()[.=' inizio TESTO '] and
                                  following::comment()[.=' fine TESTO ']]

在开始和结束标记之间选择文本（Kayessian公式）

注意，前面的表达式可以更直接地表示为两个节点集的交集：1）所有文本节点在开始标记之后; 2）结束标记之前的所有文本节点。在XPath 1.0中有一个执行交集的通用公式：

$set1[count(.|$set2)=count($set2)]

这里的一般想法，用英语表示，如果您将$set1中的元素添加到$set2并且$set2的大小不会更改，那么该节点必须已经在$set2。来自$set1的所有节点的集合就是$set1和$set2的交集。

在您的具体案例中：

$set1 = //div[@class='body-text']/comment()[.=' inizio TESTO ']/following::text()
$set2 = //div[@class='body-text']/comment()[.=' fine TESTO ']/preceding::text()

全部放在一起：

//div[@class='body-text']/comment()[.=' inizio TESTO ']/following::text()[
   count(.|//div[@class='body-text']/comment()[.=' fine TESTO ']/preceding::text())
     =
   count(//div[@class='body-text']/comment()[.=' fine TESTO ']/preceding::text())]

R xpathSApply：获取节点文本而不从其子节点获取文本

2 个答案:

在开始标记

在开始和结束标记之间选择文本

在开始和结束标记之间选择文本（Kayessian公式）