如何仅获取包含XPath的其他元素的元素的文本?

时间:2016-05-07 22:50:25

标签: html ruby xpath nokogiri

我正在使用XPath解析Nokogiri的文档。我对结构列表的内容感兴趣:

<ul>
  <li>
    <div>
      <!-- Some data I'm not interested in -->
    </div>
    <span>
      <a href="some_url">A name I already got easily</a>
      <br>
      Some text I need to get but just can't
    </span>
  </li>
  <li>
    <div>
      <!-- Some data I'm not interested in again -->
    </div>
    <span>
      <a href="some_other_url">Another name I already got easily</a>
      <br>
      Some other text I need to get but just can't
    </span>
  </li>
  .
  .
  .
</ul>

我正在使用:

politicians = Array.new
rows = doc.xpath('//ul/li')
rows.each do |row|
  politician = OpenStruct.new
  politician.name = row.at_xpath('span/a/text()').to_s.strip.upcase
  politician.url = row.at_xpath('span/a/@href').to_s.strip
  politician.party = row.at_xpath('span').to_s.strip
  politicians.push(politician)
end

这适用于politician.namepolitician.url,但对于politician.party<br>标记之后的文字,我无法隔离文字。使用

row.at_xpath('span').to_s.strip

为我提供了<span>标记的所有内容,包括其他HTML元素。

有关如何获取此文本的任何建议?

2 个答案:

答案 0 :(得分:4)

span/text()返回空,因为<span>中的第一个文本节点是位于span开头标记和<a/>元素之间的空格(换行符和空格)。请尝试使用以下XPath:

span/text()[normalize-space()]

此XPath应返回非空文本节点,它是<span>

的直接子节点

答案 1 :(得分:1)

我这样做:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span>
  <a href="some_other_url">Another name I already got easily</a>
  <br>
  Some other text I need to get but just can't
</span>
EOT

doc.at('span br').next.text # => "\n  Some other text I need to get but just can't\n"

doc.at('//span/br').next.text # => "\n  Some other text I need to get but just can't\n"

清理生成的字符串很简单:

"\n  Some other text I need to get but just can't\n".strip # => "Some other text I need to get but just can't"

您的代码存在的问题是您没有深入了解DOM以获得您想要的内容,而且您做错了事情:

doc.at_xpath('//span').to_s # => "<span>\n  <a href=\"some_other_url\">Another name I already got easily</a>\n  <br>\n  Some other text I need to get but just can't\n</span>"

to_sto_html相同,并返回原始标记中的节点。使用text将删除标记,这会让你更接近,但是,再次,你站得太远了:

doc.at_xpath('//span').text # => "\n  Another name I already got easily\n  \n  Some other text I need to get but just can't\n"

由于<br>不是容器,您无法获取其文本,但您仍然可以使用它来导航,然后获取next节点,即Text节点,并检索它:

doc.at('span br').next.class # => Nokogiri::XML::Text

解析XML / HTML时,指向所需的实际节点非常重要,然后使用适当的方法。如果不这样做会迫使你试图获取你想要的实际数据。

把所有这些放在一起,我做了类似的事情:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span>
  <a href="some_other_url">Another name I already got easily</a>
  <br>
  Some other text I need to get but just can't
</span>
EOT

data = doc.search('span').map{ |span|
  name = span.at('a').text
  url = span.at('a')['href']
  party = span.at('br').next.text.strip

  {
    name: name,
    url: url,
    party: party
  }
}
# => [{:name=>"Another name I already got easily", :url=>"some_other_url", :party=>"Some other text I need to get but just can't"}]

您可以折叠/转动/毁伤以使其弯曲。

最后,不要做search('//path/to/some/node/text()').text。你浪费了按键和CPU:

doc = Nokogiri::HTML(<<EOT)
<p>
  Some other text I need to get but just can't
</p>
EOT

doc.at('//p')        # => #<Nokogiri::XML::Element:0x3fed0841edf0 name="p" children=[#<Nokogiri::XML::Text:0x3fed0841e918 "\n  Some other text I need to get but just can't\n">]>
doc.at('//p/text()') # => #<Nokogiri::XML::Text:0x3fed0841e918 "\n  Some other text I need to get but just can't\n">

text()会返回一个文本节点,但它不会返回文本。

结果你被迫做了:

doc.at('//p/text()').text # => "\n  Some other text I need to get but just can't\n"

相反,指出你想要的东西并告诉Nokogiri得到它:

doc.at('//p').text  # => "\n  Some other text I need to get but just can't\n"

XPath可以指向节点,但是当我们想要文本时这并没有帮助,所以简化选择器。