干字并使用Lucene 4.0创建没有停用词的索引

时间:2013-01-04 08:45:08

标签: lucene stemming stop-words

我有以下问题:有几个文本文档,我需要解析和创建一个索引,但没有停止的单词,并阻止条款。我可以手动完成但我听到一位同事关于Lucene可以做到这一点自动。 我在网上搜索并发现了许多我尝试过的例子,但每个例子都使用了不同版本的lucene和不同的方法,但没有一个例子是完整的。 在这个过程结束时,我需要计算我的集合中每个术语的tf / idf。

更新:我现在已经创建了一个带有一个文档的索引。该文档没有停止的单词并且被阻止。如何计算这个doc uisng lucenc的tf / idf? (在我弄清楚如何进行计算之后,我会添加更多文档)

任何有关lucene的帮助将不胜感激。 谢谢。

import java.io.*;
    import java.util.HashSet;
    import org.apache.lucene.analysis.*;
    import org.apache.lucene.analysis.tokenattributes.*;
    import org.apache.lucene.analysis.standard.*;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.FSDirectory;
    import org.apache.lucene.util.*;
    import org.apache.lucene.analysis.snowball.*;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexWriter;


public class Stemmer
{
    static HashSet<String> stopWordsList = null;

    public static String Stem(String text, String language) throws IOException
    {
        parse p = new parse();
        stopWordsList = p.readStopWordsFile();
        StringBuffer result = new StringBuffer();
        if (text!=null && text.trim().length()>0)
        {
            StringReader tReader = new StringReader(text);
            // Analyzer analyzer = new StopAnalyzer(Version.LUCENE_36,stopWordsList);
            @SuppressWarnings("deprecation")
            Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_35,"English",stopWordsList);
            // disk index storage
            Directory directory = FSDirectory.open(new File("d:/index")); 

            @SuppressWarnings("deprecation")
            IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));

            TokenStream tStream = analyzer.tokenStream("contents", tReader);
            @SuppressWarnings("deprecation")
            TermAttribute term = tStream.addAttribute(TermAttribute.class);

            try {
                while (tStream.incrementToken())
                    {
                        result.append(term.term());
                        result.append(" ");
                    }

                Document doc = new Document();
                String title = "DocID";
                // adding title field
                doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); 
                String content = result.toString();
                // adding content field
                doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
                // writing new document to the index
                writer.addDocument(doc);  
                writer.close();
                System.out.println("Reult is: " + result);  
            } 
            catch (IOException ioe)
                {
                    System.out.println("Error: "+ioe.getMessage());
                }
        }

        // If, for some reason, the stemming did not happen, return the original text
        if (result.length()==0)
            result.append(text);
        return result.toString().trim();

    } //end stem

    public static void main (String[] args) throws IOException
        {
            Stemmer.Stem("Michele Bachmann amenities pressed her allegations that the former head of her Iowa presidential bid was bribed by the campaign of rival Ron Paul to endorse him, even as one of her own aides denied the charge.", "English");
        }
}//end class    

1 个答案:

答案 0 :(得分:0)

要过滤停用词,请使用StopAnalyzer。它将删除以下单词:

  "a", "an", "and", "are", "as", "at", "be", "but", "by",
  "for", "if", "in", "into", "is", "it",
  "no", "not", "of", "on", "or", "such",
  "that", "the", "their", "then", "there", "these",
  "they", "this", "to", "was", "will", "with"
如果您使用addDocument(Iterable<? extends IndexableField> doc, Analyzer analyzer)方法以及搜索期间,可以提供

Analyzer。有关更多选项和详细信息,请参阅Javadoc。

对于词干,请查看this SO post

很难提出更多建议,因为你没有解释究竟什么失败了。