如何使弹性搜索得分考虑到场长

时间:2018-04-23 04:18:20

标签: elasticsearch normalization relevance

我创建了一个非常简单的测试索引,包含以下5个条目:

{    "tags": [        { "topics": "music festival dance techno germany"}    ]}
{    "tags": [        { "topics": "music festival dance techno"}    ]}
{    "tags": [        { "topics": "music festival dance"}    ]}
{    "tags": [        { "topics": "music festival"}    ]}
{    "tags": [        { "topics": "music"}    ]}

然后我执行了以下查询:

{
  "query": { 
    "bool": { 
      "should": [
        { "match": { "tags.topics": "music festival"}}
      ]
    }
  }
}

期望在结果中获得以下顺序:

1)“音乐节”

2)“音乐节舞蹈”

3)“音乐节舞蹈技术”

4)“音乐节舞蹈技术德国”

5)“音乐”

计算字段长度标准化。

但是我得到了以下内容:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 5,
        "max_score": 0.5753642,
        "hits": [
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival dance techno germany"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "3",
                "_score": 0.5753642,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival dance"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "4",
                "_score": 0.42221835,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "2",
                "_score": 0.32088596,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival dance techno"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "5",
                "_score": 0.2876821,
                "_source": {
                    "tags": [
                        {
                            "topics": "music"
                        }
                    ]
                }
            }
        ]
    }
}

除了只匹配一个单词的最低分数外,其顺序似乎绝对随机。

可能导致此问题的原因是什么,我可以更改(在映射,索引或搜索期间)以获得预期的订单?

注意:非完美匹配查询也是如此。搜索“音乐舞蹈”仍然应该产生3个单词条目作为第一个结果,因此使用或提升术语查询似乎是不可能的。

1 个答案:

答案 0 :(得分:0)

正如我在this answer中所描述的那样,得分/相关性不是Elasticsearch中最简单的主题。

我试图为你找出解决方案,目前我有类似的东西。

文件:

{ "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }
{ "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }
{ "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }
{ "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }
{ "tags": [ { "topics": ["music"]} ], "topics_count": 1 }

和查询:

{
  "query": {
    "bool": {
      "should": [
        {
          "function_score": {
            "query": {
              "terms_set": {
                "tags.topics" : {
                  "terms" : ["music", "festival"],
                  "minimum_should_match_script": {
                    "source": "params.num_terms"
                  }
                }
              }
            },
            "script_score" : {
              "script" : {
                "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
              }
            }
          }
        },
        {
          "function_score": {
            "query": {
              "terms_set": {
                "tags.topics" : {
                 "terms" : ["music", "festival"],
                 "minimum_should_match_script": {
                    "source": "doc['topics_count'].value"
                  }
                }
              }
            },
            "script_score" : {
              "script" : {
                "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
              }
            }
          }
        }
      ]
    }
  }
}

这不完美。仍然需要一些改进。在这个例子中,["music", "festival"]["music", "dance"]的效果很好(在ES 6.2上测试过),但我猜测在其他结果上它不会像预期的那样100%工作。主要是因为相关性/得分复杂性。但是你现在可以阅读更多关于我使用的东西并尝试改进它。