Question

在索引过滤器中，有没有办法找出当前URL /文档源自的Anchor文本？我尝试了inlinks，但似乎是空的。

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum,          Inlinks inlinks) IndexingException {

    //Need to know the anchor text from which the current document originated from at this  point

}

如果当前网址为http://foo.com/pagex，则必须在http://foo.com找到指向pagex的链接。我需要知道这个链接的锚文本。

Answer 1

锚文本可在链接中找到，但要填充此内容，db.ignore.internal.links和linkdb.ignore.external.links必须在false中设置为nutch-default.xml。或者，可以在nutch-site.xml中覆盖它们。

Nutch：锚定当前URL的文本

1 个答案: