Solr得分关键字检测率

时间:2017-10-05 08:24:30

标签: solr

我使用过Solr 6.1

我现在正在设置分数,

但我在分数上有一些问题

我只是搜索GCS,qf集是:title ^ 100 content ^ 70 text ^ 50,

三个字段都是text_general,

我得到的第一个结果得分是1050.8486,另一个是853.08655,

但是第一个内容在内容领域是如此之短,而另一个内容在内容领域是如此之多,

我只是不知道为什么第一个得分会很多

以下两个结果debugquery内容:

1002.8741 =总和:\ n 1002.8741 =最大值:\ n 1002.8741 =体重(标题:1275年的GCS)[],结果:\ n 1002.8741 =得分(doc = 1275,freq = 1.0 = termFreq = 1.0 \ n),产品:\ n 100.0 = boost \ n 8.513557 = idf(docFreq = 27,docCount = 137000)\ n 1.177973 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 6.3423285 = avgFieldLength \ n 4.0 = fieldLength \ n 928.3479 =体重(内容:1275年的GCS)[],结果:\ n 928.3479 =得分(doc = 1275,freq = 2.0 = termFreq = 2.0 \ n),产品:\ n 70.0 = boost \ n 7.1785564 = idf(docFreq = 104,docCount = 137000)\ n 1.8474623 = tfNorm,计算自:\ n 2.0 = termFreq = 2.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 176.37256 = avgFieldLength \ n 16.0 = fieldLength \ n

811.1335 =总和:\ n 811.1335 =最大值:\ n 127.21202 =重量(文字:9400中的GCS)[],结果:\ n 127.21202 =得分(doc = 9400,freq = 1.0 = termFreq = 1.0 \ n),产品:\ n 50.0 = boost \ n 7.464645 = idf(docFreq = 78,docCount = 137000)\ n 0.3408388 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 44.69738 = avgFieldLength \ n 256.0 = fieldLength \ n 811.1335 = weight(标题:9400中的GCS)[],结果:\ n 811.1335 =得分(doc = 9400,freq = 1.0 = termFreq = 1.0 \ n),产品:\ n 100.0 = boost \ n 8.513557 = idf(docFreq = 27,docCount = 137000)\ n 0.9527551 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 6.3423285 = avgFieldLength \ n 7.111111 = fieldLength \ n 174.06395 = weight(内容:9400中的GCS)[],结果:\ n 174.06395 =得分(doc = 9400,freq = 7.0 = termFreq = 7.0 \ n ),产品:\ n 70.0 = boost \ n 7.1785564 = idf(docFreq = 104,docCount = 137000)\ n 0.34639663 = tfNorm,计算自:\ n 7.0 = termFreq = 7.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 176.37256 = avgFieldLength \ n 7281.778 = fieldLength \ n

=============================================== ============================

当我使用分片时,我还有另一个问题,omitNorms它不起作用?为什么?我发现长内容的内容得分较短?架构是相同的

第一个来自A集合是短内容,另一个是B集合和长内容:

1158.9161 =总和:\ n 1158.9161 =最大值:\ n 1158.9161 =重量(标题:波音52601)[],结果:\ n 1158.9161 =得分(doc = 52601,freq = 1.0 = termFreq = 1.0 \ n),产品:\ n 100.0 = boost \ n 11.589161 = idf(docFreq = 5,docCount = 593568)\ n 1.0 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.0 =参数b(字段省略规范)\ n 1085.6042 =重量(内容:波音52601)[],结果:\ n 1085.6042 =得分(doc = 52601,freq = 2.0 = termFreq = 2.0 \ n),产品of:\ n 70.0 = boost \ n 11.279006 = idf(docFreq = 7,docCount = 593568)\ n 1.375 = tfNorm,计算自:\ n 2.0 = termFreq = 2.0 \ n 1.2 =参数k1 \ n 0.0 =参数b(字段省略的规范)\ n

1060.8777 =总和:\ n 1060.8777 =最大值:\ n 433.1234 =体重(文字:boeing in 39406)[],结果:\ n 433.1234 =得分(doc = 39406,freq = 1.0 = termFreq = 1.0 \ n),产品:\ n 50.0 = boost \ n 8.662468 = idf(docFreq = 112,docCount = 650450)\ n 1.0 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.0 =参数b(字段省略的范数)\ n 884.746 =重量(标题:波音39406)[],结果:\ n 884.746 =得分(doc = 39406,freq = 1.0 = termFreq = 1.0 \ n),产品of:\ n 100.0 = boost \ n 8.84746 = idf(docFreq = 93,docCount = 650450)\ n 1.0 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.0 =参数b(字段省略的规范)\ n 1060.8777 =重量(内容:波音39406)[],结果:\ n 1060.8777 =得分(doc = 39406,freq = 7.0 = termFreq = 7.0 \ n),乘积:\ n 70.0 = boost \ n 8.069756 = idf(docFreq = 203,docCount = 650450)\ n 1.8780 489 = tfNorm,计算自:\ n 7.0 = termFreq = 7.0 \ n 1.2 =参数k1 \ n 0.0 =参数b(字段省略规范)

1 个答案:

答案 0 :(得分:1)

Solr 6.1使用的下划线相似度是BM25 [1]。

这意味着与平均字段长度相比,字段值长度很重要。 更具体一点,你正在使用dismax,你只考虑纯粹的最大值。 所以探索最大值:

第一份文件Max:

1002.8741 =重量(标题:1275年的GCS)[],结果:\ n 1002.8741 =得分(doc = 1275,freq = 1.0 = termFreq = 1.0 \ n),乘积为:\ n 100.0 = boost \ n 8.513557 = idf(docFreq = 27,docCount = 137000)\ n 1.177973 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 6.3423285 = avgFieldLength \ n 4.0 = fieldLength \ n

第二份文件Max:

811.1335 =重量(标题:9400中的GCS)[],结果:\ n 811.1335 =得分(doc = 9400,freq = 1.0 = termFreq = 1.0 \ n),乘积为:\ n 100.0 = boost \ n 8.513557 = idf(docFreq = 27,docCount = 137000)\ n 0.9527551 = tfNorm,计算自:\ n 1.0 = termFreq = 1.0 \ n 1.2 =参数k1 \ n 0.75 =参数b \ n 6.3423285 = avgFieldLength \ n 7.111111 = fieldLength \ n

因此,较短的第一个文件标题成为赢家。 您可以使用dismax / edismax来考虑其他因素,而不仅仅是最大值[2]。

此致

[1] http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

[2] https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter