Question

我有一个带有以下映射和分析器的索引：

settings: {
    analysis: {
      char_filter: {
        custom_cleaner: {
          # remove - and * (we don't want them here)
          type: "mapping",
          mappings: ["-=>", "*=>"]
        }
      },
      analyzer: {
        custom_ngram: {
          tokenizer: "standard",
          filter: [ "lowercase", "custom_ngram_filter" ],
          char_filter: ["custom_cleaner"]
        }
      },
      filter: {
        custom_ngram_filter: {
          type: "nGram",
          min_gram: 3,
          max_gram: 20,
          token_chars: [ "letter", "digit" ]
        }
      }
    }
  },
  mappings: {
    attributes: {
      properties: {
        name: { type: "string"},
        words: { type: "string", similarity: "BM25", analyzer: "custom_ngram" }
      }
    }
  }
}

我在索引中有以下两个文件：

"name": "shirts", "words": [ "shirt"]

和

"name": "t-shirts", "words": ["t-shirt"]

我执行多匹配查询

"query": {

            "multi_match": {
               "query": "t-shirt",
               "fields": [
                  "words",
                  "name"
               ],
               "analyzer": "custom_ngram"
            }

   }

问题：

衬衫得分为1.17，而 T恤得分为0.8。为什么这样，我怎样才能实现 T恤（直接匹配）得分更高？

我需要ngrams用于另一个用例，我必须检测包含匹配项。（衬衫是肌肉衬衫，......）因此，我想，我不能跳过ngram。

谢谢！

Answer 1

我相信这种情况正在发生，因为你使用的是StandardTokenizer，它会对字符串＆＃34; t-shirt＆＃34;进行标记。进入代币＆＃34; t＆＃34;和＆＃34;衬衫＆＃34;。然而，＆＃34; t＆＃34;比短克大小短，因此不会生成令牌。因此，在每种情况下都会获得相同的匹配，但t-shirt的文档较长，因此分数较低。

您可以使用Explain API获取有关文档获取分数的原因的详细信息。

您确定需要使用ngrams吗？你的榜样，＆＃34;衬衫＆＃34;在＆＃34;肌肉衬衫＆＃34; StandardAnalyzer应该处理得很好，这将在连字符上标记。

elasticsearch ngrams：为什么更短的令牌匹配而不是更长？

1 个答案: