Jsoup找到最近的href

时间:2014-03-21 14:00:18

标签: java html html-parsing jsoup href

我有一个字符串映射,基本上我现在正在做的是获取Page主体并使用jsoup.getPageBody().split("[^a-zA-Z]+")将其拆分为单词然后遍历页面主体并检查是否存在任何单词在我的字符串地图中,例如下面:

for (String word : jsoup.getPageBody().split("[^a-zA-Z]+")) {
    if (wordIsInMap(word.toLowerCase()) {
        //At this part word is in string of maps
    }
}

当我在循环内部时,我想得到最近的超链接(href)。距离由单词量决定。我在jsoup文档页面上找不到这样的例子。我怎么能这样做?

此页面的一个示例: http://en.wikipedia.org/wiki/2012_in_American_television

如果字符串地图为racecrucial,那么我想得到:

http://en.wikipedia.org/wiki/Breeders%27_Cup_Classic

http://en.wikipedia.org/wiki/Fox_Broadcasting_Company

这两个链接。

1 个答案:

答案 0 :(得分:2)

这是一个非常简单的实现,可以帮助您入门。但是,根据单词的数量,它找不到最接近的链接。我会留给你修改。

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

import java.util.List;

public class Program {

public static void main(String...args) throws Exception {
    String searchFor = "online and";

    Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/2012_in_American_television").get();
    Element element = doc.getElementsContainingOwnText(searchFor).first();

    Node nodeWithText = getFirstNodeContainingText(element.childNodes(), searchFor);
    Element closestLink = getClosestLink(nodeWithText);

    System.out.println("Link closest to '" + searchFor + "': " + closestLink.attr("abs:href"));
}

private static Element getClosestLink(Node node) {
    Element linkElem = null;
    if (node instanceof Element) {
        Element element = (Element) node;
        linkElem = element.getElementsByTag("a").first();
    }
    if (linkElem != null) {
        return linkElem;
    }

    // This node wasn't a link. try next one
    linkElem = getClosestLink(node.nextSibling());
    if (linkElem != null) {
        return linkElem;
    }

    // Wasn't next link. try previous
    linkElem = getClosestLink(node.previousSibling());
    if (linkElem != null) {
        return linkElem;
    }

    return null;
}

private static Node getFirstNodeContainingText(List<Node> nodes, String text) {
    for (Node node : nodes) {
        if (node instanceof TextNode) {
            String nodeText = ((TextNode) node).getWholeText();
            if (nodeText.contains(text)) {
                return node;
            }
        }
    }
    return null;
}

}