Lucene为短语查询得分

时间:2014-11-21 06:35:46

标签: lucene

我使用StandardAnalyzer来索引我的文本。但是,在查询时,我正在进行术语查询和短语查询。对于术语查询和短语查询,我认为lucene在计算术语频率和短语频率方面没有问题。但是,对于像Dirichlet Similarity这样的模型来说这很好。对于BM25Similarity或TFIDFSimilarity模型,它需要IDF(term)和IDF(Phrase)。 lucene如何处理这个问题?

1 个答案:

答案 0 :(得分:1)

TFIDFSimilarity短语IDF计算为其组成条款的IDF之和。那就是:idf("ab cd") = idf(ab) + idf(cd)

然后将该值乘以短语频率,并且非常像一个术语,用于评分。

要了解整个故事,我认为看一个例子是最有意义的。 IndexSearcher.explain是用于理解评分的非常有用的工具:

指数:

  • doc 0:text ab unique
  • doc 1:text
  • doc 2:text ab cd text ab
  • doc 3:text

查询:"text ab" unique

Explain输出第一个(最高得分)命中(doc 0):

1.3350155 = (MATCH) sum of:
  0.7981777 = (MATCH) weight(content:"text ab" in 0) [DefaultSimilarity], result of:
    0.7981777 = score(doc=0,freq=1.0 = phraseFreq=1.0
), product of:
      0.7732263 = queryWeight, product of:
        2.0645385 = idf(), sum of:
          0.7768564 = idf(docFreq=4, maxDocs=4)
          1.287682 = idf(docFreq=2, maxDocs=4)
        0.37452745 = queryNorm
      1.0322692 = fieldWeight in 0, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = phraseFreq=1.0
        2.0645385 = idf(), sum of:
          0.7768564 = idf(docFreq=4, maxDocs=4)
          1.287682 = idf(docFreq=2, maxDocs=4)
        0.5 = fieldNorm(doc=0)
  0.5368378 = (MATCH) weight(content:unique in 0) [DefaultSimilarity], result of:
    0.5368378 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
      0.6341301 = queryWeight, product of:
        1.6931472 = idf(docFreq=1, maxDocs=4)
        0.37452745 = queryNorm
      0.8465736 = fieldWeight in 0, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        1.6931472 = idf(docFreq=1, maxDocs=4)
        0.5 = fieldNorm(doc=0)

注意,上半部分,处理查询"text ab"部分的得分与下半部分(得分unique)非常相同,除了短语idf的加总和计算

第二次命中的

Explain输出(为了好的衡量标准)(文档2):

0.49384725 = (MATCH) product of:
  0.9876945 = (MATCH) sum of:
    0.9876945 = (MATCH) weight(content:"text ab" in 2) [DefaultSimilarity], result of:
      0.9876945 = score(doc=2,freq=2.0 = phraseFreq=2.0
), product of:
        0.7732263 = queryWeight, product of:
          2.0645385 = idf(), sum of:
            0.7768564 = idf(docFreq=4, maxDocs=4)
            1.287682 = idf(docFreq=2, maxDocs=4)
          0.37452745 = queryNorm
        1.277368 = fieldWeight in 2, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = phraseFreq=2.0
          2.0645385 = idf(), sum of:
            0.7768564 = idf(docFreq=4, maxDocs=4)
            1.287682 = idf(docFreq=2, maxDocs=4)
          0.4375 = fieldNorm(doc=2)
  0.5 = coord(1/2)