Question

I am scraping some data whos heirarchy is /h2/a but a's href should contain http://www.thedomain.com. All links are something like this: thedomain.com/test and so on. Right now I get the text only but not the name of the href link itself.

For example:

<h2>
<a href="http://www.thedomain.com/test">Hey there</a>
<a href="http://www.thedomain.com/test1">2nd link</a>
<a href="http://www.thedomain.com/test2">3rd link</a>
</h2>

Here is my code:

html_doc.xpath('//h2/a[contains(@href, "http://www.thedomain.com")]/text()')

Hey there, 2nd link, 3rd link

Whereas I want http://www.thedomain.com/test and so on.

Answer 1

只需获取@href而不是text()：

//h2/a[contains(@href, "http://www.thedomain.com")]/@href

Answer 2

为此，您还可以使用CSS选择器（在这种情况下可能比xpath更容易使用）。您可以使用以下选项<a>下的h2元素

html_doc.css('h2 a')

这是代码的完整工作版本：

html = <<EOT
<html>
    <h2>
        <a href="http://www.thedomain.com/test">Hey there</a>
        <a href="http://www.thedomain.com/test1">2nd link</a>
        <a href="http://www.thedomain.com/test2">3rd link</a>
    </h2>
</html>
EOT

html_doc = Nokogiri::HTML(html)
html_doc.css('h2 a').map { |link| p link['href'] }
# => "http://www.thedomain.com/test"
# => "http://www.thedomain.com/test1"
# => "http://www.thedomain.com/test2"

获取链接名称href <a> tag nokogiri

2 个答案: