最佳解决方案。

Question

我需要自动填充phrases。例如，当我在alz＆＃34; 中搜索＆＃34;痴呆症时，我想在阿尔茨海默氏症＆＃34; 中得到＆＃34;痴呆症。

为此，我配置了Edge NGram tokenizer。我在查询正文中尝试了edge_ngram_analyzer和standard作为分析器。然而，当我试图匹配一个短语时，我无法得到结果。

我做错了什么？

我的查询：

{ "query":{ "multi_match":{ "query":"dementia in alz", "type":"phrase", "analyzer":"edge_ngram_analyzer", "fields":["_all"] } } }

我的映射：

... "type" : { "_all" : { "analyzer" : "edge_ngram_analyzer", "search_analyzer" : "standard" }, "properties" : { "field" : { "type" : "string", "analyzer" : "edge_ngram_analyzer", "search_analyzer" : "standard" }, ... "settings" : { ... "analysis" : { "filter" : { "stem_possessive_filter" : { "name" : "possessive_english", "type" : "stemmer" } }, "analyzer" : { "edge_ngram_analyzer" : { "filter" : [ "lowercase" ], "tokenizer" : "edge_ngram_tokenizer" } }, "tokenizer" : { "edge_ngram_tokenizer" : { "token_chars" : [ "letter", "digit", "whitespace" ], "min_gram" : "2", "type" : "edgeNGram", "max_gram" : "25" } } } ...

我的文件：

{ "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfHfBE5CzEm8aJ3Xp", "_source": { "@timestamp": "2016-08-02T13:40:48.665Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1400541", "Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset", "@version": "1", }, "_index": "carenotes" }, { "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfICrE5CzEm8aJ4Dc", "_source": { "@timestamp": "2016-08-02T13:40:51.240Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1424351", "Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset", "@version": "1", }, "_index": "carenotes" }

分析阿尔茨海默病＆＃34;老年痴呆症＆＃34; 短语：

{ "tokens": [ { "end_offset": 2, "token": "de", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "dem", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 4, "token": "deme", "type": "word", "start_offset": 0, "position": 2 }, { "end_offset": 5, "token": "demen", "type": "word", "start_offset": 0, "position": 3 }, { "end_offset": 6, "token": "dement", "type": "word", "start_offset": 0, "position": 4 }, { "end_offset": 7, "token": "dementi", "type": "word", "start_offset": 0, "position": 5 }, { "end_offset": 8, "token": "dementia", "type": "word", "start_offset": 0, "position": 6 }, { "end_offset": 9, "token": "dementia ", "type": "word", "start_offset": 0, "position": 7 }, { "end_offset": 10, "token": "dementia i", "type": "word", "start_offset": 0, "position": 8 }, { "end_offset": 11, "token": "dementia in", "type": "word", "start_offset": 0, "position": 9 }, { "end_offset": 12, "token": "dementia in ", "type": "word", "start_offset": 0, "position": 10 }, { "end_offset": 13, "token": "dementia in a", "type": "word", "start_offset": 0, "position": 11 }, { "end_offset": 14, "token": "dementia in al", "type": "word", "start_offset": 0, "position": 12 }, { "end_offset": 15, "token": "dementia in alz", "type": "word", "start_offset": 0, "position": 13 }, { "end_offset": 16, "token": "dementia in alzh", "type": "word", "start_offset": 0, "position": 14 }, { "end_offset": 17, "token": "dementia in alzhe", "type": "word", "start_offset": 0, "position": 15 }, { "end_offset": 18, "token": "dementia in alzhei", "type": "word", "start_offset": 0, "position": 16 }, { "end_offset": 19, "token": "dementia in alzheim", "type": "word", "start_offset": 0, "position": 17 }, { "end_offset": 20, "token": "dementia in alzheime", "type": "word", "start_offset": 0, "position": 18 }, { "end_offset": 21, "token": "dementia in alzheimer", "type": "word", "start_offset": 0, "position": 19 } ] }

Answer 1

非常感谢帮助我找到正确解决方案的rendel！

Andrei Stefan的解决方案不是最佳解决方案。

为什么呢？首先，搜索分析器中没有小写滤波器使得搜索不方便;案件必须严格匹配。需要使用lowercase过滤器的自定义分析器，而不是"analyzer": "keyword"。

其次，分析部分错误！在索引时间期间，通过edge_ngram_analyzer分析字符串“ F00.0-早发性阿尔茨海默病中的痴呆”。使用此分析器，我们将以下字典数组作为分析字符串：

{ "tokens": [ { "end_offset": 2, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 3, "token": "f00", "type": "word", "start_offset": 0, "position": 1 }, { "end_offset": 6, "token": "0 ", "type": "word", "start_offset": 4, "position": 2 }, { "end_offset": 9, "token": " ", "type": "word", "start_offset": 7, "position": 3 }, { "end_offset": 10, "token": " d", "type": "word", "start_offset": 7, "position": 4 }, { "end_offset": 11, "token": " de", "type": "word", "start_offset": 7, "position": 5 }, { "end_offset": 12, "token": " dem", "type": "word", "start_offset": 7, "position": 6 }, { "end_offset": 13, "token": " deme", "type": "word", "start_offset": 7, "position": 7 }, { "end_offset": 14, "token": " demen", "type": "word", "start_offset": 7, "position": 8 }, { "end_offset": 15, "token": " dement", "type": "word", "start_offset": 7, "position": 9 }, { "end_offset": 16, "token": " dementi", "type": "word", "start_offset": 7, "position": 10 }, { "end_offset": 17, "token": " dementia", "type": "word", "start_offset": 7, "position": 11 }, { "end_offset": 18, "token": " dementia ", "type": "word", "start_offset": 7, "position": 12 }, { "end_offset": 19, "token": " dementia i", "type": "word", "start_offset": 7, "position": 13 }, { "end_offset": 20, "token": " dementia in", "type": "word", "start_offset": 7, "position": 14 }, { "end_offset": 21, "token": " dementia in ", "type": "word", "start_offset": 7, "position": 15 }, { "end_offset": 22, "token": " dementia in a", "type": "word", "start_offset": 7, "position": 16 }, { "end_offset": 23, "token": " dementia in al", "type": "word", "start_offset": 7, "position": 17 }, { "end_offset": 24, "token": " dementia in alz", "type": "word", "start_offset": 7, "position": 18 }, { "end_offset": 25, "token": " dementia in alzh", "type": "word", "start_offset": 7, "position": 19 }, { "end_offset": 26, "token": " dementia in alzhe", "type": "word", "start_offset": 7, "position": 20 }, { "end_offset": 27, "token": " dementia in alzhei", "type": "word", "start_offset": 7, "position": 21 }, { "end_offset": 28, "token": " dementia in alzheim", "type": "word", "start_offset": 7, "position": 22 }, { "end_offset": 29, "token": " dementia in alzheime", "type": "word", "start_offset": 7, "position": 23 }, { "end_offset": 30, "token": " dementia in alzheimer", "type": "word", "start_offset": 7, "position": 24 }, { "end_offset": 33, "token": "s ", "type": "word", "start_offset": 31, "position": 25 }, { "end_offset": 34, "token": "s d", "type": "word", "start_offset": 31, "position": 26 }, { "end_offset": 35, "token": "s di", "type": "word", "start_offset": 31, "position": 27 }, { "end_offset": 36, "token": "s dis", "type": "word", "start_offset": 31, "position": 28 }, { "end_offset": 37, "token": "s dise", "type": "word", "start_offset": 31, "position": 29 }, { "end_offset": 38, "token": "s disea", "type": "word", "start_offset": 31, "position": 30 }, { "end_offset": 39, "token": "s diseas", "type": "word", "start_offset": 31, "position": 31 }, { "end_offset": 40, "token": "s disease", "type": "word", "start_offset": 31, "position": 32 }, { "end_offset": 41, "token": "s disease ", "type": "word", "start_offset": 31, "position": 33 }, { "end_offset": 42, "token": "s disease w", "type": "word", "start_offset": 31, "position": 34 }, { "end_offset": 43, "token": "s disease wi", "type": "word", "start_offset": 31, "position": 35 }, { "end_offset": 44, "token": "s disease wit", "type": "word", "start_offset": 31, "position": 36 }, { "end_offset": 45, "token": "s disease with", "type": "word", "start_offset": 31, "position": 37 }, { "end_offset": 46, "token": "s disease with ", "type": "word", "start_offset": 31, "position": 38 }, { "end_offset": 47, "token": "s disease with e", "type": "word", "start_offset": 31, "position": 39 }, { "end_offset": 48, "token": "s disease with ea", "type": "word", "start_offset": 31, "position": 40 }, { "end_offset": 49, "token": "s disease with ear", "type": "word", "start_offset": 31, "position": 41 }, { "end_offset": 50, "token": "s disease with earl", "type": "word", "start_offset": 31, "position": 42 }, { "end_offset": 51, "token": "s disease with early", "type": "word", "start_offset": 31, "position": 43 }, { "end_offset": 52, "token": "s disease with early ", "type": "word", "start_offset": 31, "position": 44 }, { "end_offset": 53, "token": "s disease with early o", "type": "word", "start_offset": 31, "position": 45 }, { "end_offset": 54, "token": "s disease with early on", "type": "word", "start_offset": 31, "position": 46 }, { "end_offset": 55, "token": "s disease with early ons", "type": "word", "start_offset": 31, "position": 47 }, { "end_offset": 56, "token": "s disease with early onse", "type": "word", "start_offset": 31, "position": 48 } ] }

如您所见，整个字符串使用2到25个字符的标记大小进行标记。字符串以线性方式标记化，并且每个新标记的所有空格和位置都加1。

它有几个问题：

edge_ngram_analyzer生成无用的令牌，永远不会搜索，例如：“ 0 ”，“”，“ d “，” sd “，” s疾病w “等。

此外，它没有产生可以使用的许多有用的令牌，例如：“疾病”，“< em>早发“等。如果您尝试搜索任何这些单词，将会有0个结果。
请注意，最后一个标记是“ s疾病与早期”。最终的“ t ”在哪里？由于"max_gram" : "25"我们“丢失”所有字段中的某些文字。您不能再搜索此文本，因为它没有令牌。
trim过滤器仅在标记器可以完成时过滤额外空格的问题。
edge_ngram_analyzer增加每个标记的位置，这对于诸如短语查询之类的位置查询是有问题的。在生成ngrams时，应使用edge_ngram_filter代替保留令牌的位置。

最佳解决方案。

要使用的映射设置：

...
"mappings": {
    "Type": {
       "_all":{
          "analyzer": "edge_ngram_analyzer", 
          "search_analyzer": "keyword_analyzer"
        }, 
        "properties": {
          "Field": {
            "search_analyzer": "keyword_analyzer",
             "type": "string",
             "analyzer": "edge_ngram_analyzer"
          },
...
...
"settings": {
   "analysis": {
      "filter": {
         "english_poss_stemmer": {
            "type": "stemmer",
            "name": "possessive_english"
         },
         "edge_ngram": {
           "type": "edgeNGram",
           "min_gram": "2",
           "max_gram": "25",
           "token_chars": ["letter", "digit"]
         }
      },
      "analyzer": {
         "edge_ngram_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
           "tokenizer": "standard"
         },
         "keyword_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer"],
           "tokenizer": "standard"
         }
      }
   }
}
...

看看分析：

{
  "tokens": [
    {
      "end_offset": 5, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 17, 
      "token": "de", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 20, 
      "token": "in", 
      "type": "word", 
      "start_offset": 18, 
      "position": 3
    }, 
    {
      "end_offset": 32, 
      "token": "al", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alz", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzh", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhe", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhei", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheim", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheime", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheimer", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 40, 
      "token": "di", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dis", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dise", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disea", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "diseas", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disease", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 45, 
      "token": "wi", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "wit", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "with", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 51, 
      "token": "ea", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "ear", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "earl", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "early", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 57, 
      "token": "on", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "ons", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onse", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onset", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }
  ]
}

在索引时间上，standard令牌化程序会对文本进行标记，然后通过lowercase，possessive_english和edge_ngram过滤器对单独的字词进行过滤。 只为单词生成标记。在搜索时，standard标记生成器会对文本进行标记，然后lowercase和possessive_english会对单独的文字进行过滤。搜索到的单词与在索引时间内创建的标记匹配。

因此我们可以进行增量搜索！

现在，因为我们在单独的单词上做ngram，我们甚至可以执行像

{
  'query': {
    'multi_match': {
      'query': 'dem in alzh',  
      'type': 'phrase', 
      'fields': ['_all']
    }
  }
}

并获得正确的结果。

没有任何文字“丢失”，一切都是可搜索的，并且不再需要trim过滤器来处理空格。

Answer 2

我认为您的查询错误：虽然您在索引时需要nGrams，但在搜索时不需要它们。在搜索时，您需要尽可能“固定”文本。请尝试此查询：

{
  "query": {
    "multi_match": {
      "query": "  dementia in alz",
      "analyzer": "keyword",
      "fields": [
        "_all"
      ]
    }
  }
}

您注意到dementia之前有两个空格。这些由您的分析器从文本中解释。要摆脱那些你需要trim token_filter：

   "edge_ngram_analyzer": {
      "filter": [
        "lowercase","trim"
      ],
      "tokenizer": "edge_ngram_tokenizer"
    }

然后这个查询将起作用（dementia之前没有空格）：

{
  "query": {
    "multi_match": {
      "query": "dementia in alz",
      "analyzer": "keyword",
      "fields": [
        "_all"
      ]
    }
  }
}

最佳解决方案。

边缘NGram与短语匹配

2 个答案: