lucene短语查询不起作用

时间:2011-09-14 10:22:40

标签: lucene

我正在尝试使用Lucene 2.9.4编写一个简单的程序,它搜索一个短语查询,但我得到0次点击

public class HelloLucene {

public static void main(String[] args) throws IOException, ParseException{
    // TODO Auto-generated method stub

    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_29);
    Directory index = new RAMDirectory();

    IndexWriter w = new IndexWriter(index,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED);
    addDoc(w, "Lucene in Action");
    addDoc(w, "Lucene for Dummies");
    addDoc(w, "Managing Gigabytes");
    addDoc(w, "The Art of Computer Science");
    w.close();      

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term("content", "lucene"),0);
    pq.add(new Term("content", "in"),1);
    pq.setSlop(0);

    int hitsPerPage = 10;
    IndexSearcher searcher = new IndexSearcher(index,true);
    TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
    searcher.search(pq, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for(int i=0; i<hits.length; i++){
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i+1)+ "." + d.get("content"));
    }

    searcher.close();


}

public static void addDoc(IndexWriter w, String value)throws IOException{
    Document doc = new Document();
    doc.add(new Field("content", value, Field.Store.YES, Field.Index.NOT_ANALYZED));
    w.addDocument(doc);
}

}

请告诉我有什么问题。我也尝试过如下使用QueryParser

String querystr ="\"Lucene in Action\"";

    Query q = new QueryParser(Version.LUCENE_29, "content",analyzer).parse(querystr);

但这也行不通。

3 个答案:

答案 0 :(得分:4)

代码存在两个问题(它们与您的Lucene版本无关):

1)StandardAnalyzer不会对停用词(如“in”)进行索引,因此PhraseQuery永远无法找到短语“Lucene in”

2)如Xodarap和Shashikant Kore所述,您创建文档的调用需要包含Index.ANALYZED,否则Lucene不会在文档的这一部分使用Analyzer。使用Index.NOT_ANALYZED可能有一种很好的方法,但我不熟悉它。

要轻松修复,请将addDoc方法更改为:

public static void addDoc(IndexWriter w, String value)throws IOException{
    Document doc = new Document();
    doc.add(new Field("content", value, Field.Store.YES, Field.Index.ANALYZED));
    w.addDocument(doc);
}

并将您创建的PhraseQuery修改为:

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term("content", "computer"),0);
    pq.add(new Term("content", "science"),1);
    pq.setSlop(0);

这将为您提供以下结果,因为“计算机”和“科学”都不是停用词:

    Found 1 hits.
    1.The Art of Computer Science

如果你想找到“Lucene in Action”,你可以增加这个PhraseQuery的斜率(增加两个单词之间的“差距”):

    PhraseQuery pq = new PhraseQuery();
    pq.add(new Term("content", "lucene"),0);
    pq.add(new Term("content", "action"),1);
    pq.setSlop(1);

如果你真的想搜索“lucene in”这个句子,你需要选择一个不同的分析器(比如SimpleAnalyzer)。在Lucene 2.9中,只需将您对StandardAnalyzer的调用替换为:

    SimpleAnalyzer analyzer = new SimpleAnalyzer();

或者,如果您使用的是3.1或更高版本,则需要添加版本信息:

    SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_35);

以下是关于类似问题的有用帖子(这将有助于您开始使用PhraseQuery): Exact Phrase search using Lucene? - 请参阅WhiteFang34的回答。

答案 1 :(得分:1)

需要分析该字段以及需要启用术语向量。

doc.add(new Field("content", value, Field.Store.YES, Field.Index.ANALYZED,  Field.TermVector.YES));

如果您不打算从中检索该字段,则可以禁用存储  索引。

答案 2 :(得分:0)

这是我使用Lucene Version.LUCENE_35的解决方案。它也被称为http://lucene.apache.org/java/docs/releases.html的Lucene 3.5.0。如果您使用的是Eclipse之类的IDE,则可以将.jar文件添加到构建路径,这是指向3.5.0.jar文件的直接链接:http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/3.5.0/lucene-core-3.5.0.jar

当新版本的Lucene问世时,如果您继续使用3.5.0.jar,此解决方案仍然适用。

现在代码:

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

public class Index {
public static void main(String[] args) throws IOException, ParseException {
  // To store the Lucene index in RAM
    Directory directory = new RAMDirectory();
    // To store the Lucene index in your harddisk, you can use:
    //Directory directory = FSDirectory.open("/foo/bar/testindex");

    // Set the analyzer that you want to use for the task.
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
    // Creating Lucene Index; note, the new version demands configurations.
    IndexWriterConfig config = new IndexWriterConfig(
            Version.LUCENE_35, analyzer);  
    IndexWriter writer = new IndexWriter(directory, config);
    // Note: There are other ways of initializing the IndexWriter.
    // (see http://lucene.apache.org/java/3_5_0/api/all/org/apache/lucene/index/IndexWriter.html)

    // The new version of Documents.add in Lucene requires a Field argument,
    //  and there are a few ways of calling the Field constructor.
    //  (see http://lucene.apache.org/java/3_5_0/api/core/org/apache/lucene/document/Field.html)
    // Here I just use one of the Field constructor that takes a String parameter.
    List<Document> docs = new ArrayList<Document>();
    Document doc1 = new Document();
    doc1.add(new Field("content", "Lucene in Action", 
        Field.Store.YES, Field.Index.ANALYZED));
    Document doc2 = new Document();
    doc2.add(new Field("content", "Lucene for Dummies", 
        Field.Store.YES, Field.Index.ANALYZED));
    Document doc3 = new Document();
    doc3.add(new Field("content", "Managing Gigabytes", 
        Field.Store.YES, Field.Index.ANALYZED));
    Document doc4 = new Document();
    doc4.add(new Field("content", "The Art of Lucene", 
        Field.Store.YES, Field.Index.ANALYZED));

    docs.add(doc1); docs.add(doc2); docs.add(doc3); docs.add(doc4);

    writer.addDocuments(docs);
    writer.close();

    // To enable query/search, we need to initialize 
    //  the IndexReader and IndexSearcher.
    // Note: The IndexSearcher in Lucene 3.5.0 takes an IndexReader parameter
    //  instead of a Directory parameter.
    IndexReader iRead = IndexReader.open(directory);
    IndexSearcher iSearch = new IndexSearcher(iRead);

    // Parse a simple query that searches for the word "lucene".
    // Note: you need to specify the fieldname for the query 
    // (in our case it is "content").
    QueryParser parser = new QueryParser(Version.LUCENE_35, "content", analyzer);
    Query query = parser.parse("lucene in");

    // Search the Index with the Query, with max 1000 results
    ScoreDoc[] hits = iSearch.search(query, 1000).scoreDocs;

    // Iterate through the search results
    for (int i=0; i<hits.length;i++) {
        // From the indexSearch, we retrieve the search result individually
        Document hitDoc = iSearch.doc(hits[i].doc);
        // Specify the Field type of the retrieved document that you want to print.
        // In our case we only have 1 Field i.e. "content".
        System.out.println(hitDoc.get("content"));
    }
    iSearch.close(); iRead.close(); directory.close();
}   
}