使用ShingleFilter,但考虑标点符号

时间:2014-03-17 10:37:43

标签: lucene

我想把自己的话放在一边,但是不想把任何标点符号用在逗号或点上。我怎样才能做到这一点?

我目前的链是:

TokenStream tokenStream = new StandardTokenizer(LUCENE_VERSION, new StringReader(input));
tokenStream =  new StandardFilter( LUCENE_VERSION, tokenStream );
tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter( tokenStream, 2 );

当我处理以下句子时:

A test sentence, great thing. Considering punctuation would be great, too.

结果将是(这里忽略单个单词):

test sentence; sentence great; great thing; thing considering; considering punctuation;

但是我想要跟随结果(这里忽略单个单词):

test sentence; great thing; considering punctuation;

1 个答案:

答案 0 :(得分:0)

我自己找到了一个可能的解决方案,但我非常确定通过Lucene还有另一个(更优化的)版本。但是,我的解决方案是在用Lucene提供它之前拆分字符串。

for(String part : input.split("\\p{Punct}")) {
    TokenStream tokenStream = new StandardTokenizer(LUCENE_VERSION, new StringReader(part));
    tokenStream =  new StandardFilter( LUCENE_VERSION, tokenStream );
    tokenStream = new LowerCaseFilter(LUCENE_VERSION, tokenStream);
    tokenStream = new StopFilter(LUCENE_VERSION, tokenStream, EnglishAnalyzer.getDefaultStopSet());
    tokenStream = new ShingleFilter( tokenStream, 2 );
    // do something with tokenStream...
}

如果您找到其他解决方案,请告知我们。