边缘NGram与短语匹配

时间:2016-08-09 10:20:34

标签: elasticsearch elasticsearch-mapping elasticsearch-query

我需要自动填充phrases。例如,当我在alz" 中搜索"痴呆症时,我想在阿尔茨海默氏症" 中得到"痴呆症。

为此,我配置了Edge NGram tokenizer。我在查询正文中尝试了edge_ngram_analyzerstandard作为分析器。然而,当我试图匹配一个短语时,我无法得到结果。

我做错了什么?

我的查询:

{
  "query":{
    "multi_match":{
      "query":"dementia in alz",
      "type":"phrase",
      "analyzer":"edge_ngram_analyzer",
      "fields":["_all"]
    }
  }
}

我的映射:

...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...

我的文件:

{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfHfBE5CzEm8aJ3Xp", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:48.665Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1400541", 
    "Diagnosis": "F00.0 -  Dementia in Alzheimer's disease with early onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}, 
{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfICrE5CzEm8aJ4Dc", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:51.240Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1424351", 
    "Diagnosis": "F00.1 -  Dementia in Alzheimer's disease with late onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}

分析阿尔茨海默病"老年痴呆症" 短语:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "de", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 4, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 0, 
      "position": 2
    }, 
    {
      "end_offset": 5, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 0, 
      "position": 3
    }, 
    {
      "end_offset": 6, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 0, 
      "position": 4
    }, 
    {
      "end_offset": 7, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 0, 
      "position": 5
    }, 
    {
      "end_offset": 8, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 0, 
      "position": 6
    }, 
    {
      "end_offset": 9, 
      "token": "dementia ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 7
    }, 
    {
      "end_offset": 10, 
      "token": "dementia i", 
      "type": "word", 
      "start_offset": 0, 
      "position": 8
    }, 
    {
      "end_offset": 11, 
      "token": "dementia in", 
      "type": "word", 
      "start_offset": 0, 
      "position": 9
    }, 
    {
      "end_offset": 12, 
      "token": "dementia in ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 10
    }, 
    {
      "end_offset": 13, 
      "token": "dementia in a", 
      "type": "word", 
      "start_offset": 0, 
      "position": 11
    }, 
    {
      "end_offset": 14, 
      "token": "dementia in al", 
      "type": "word", 
      "start_offset": 0, 
      "position": 12
    }, 
    {
      "end_offset": 15, 
      "token": "dementia in alz", 
      "type": "word", 
      "start_offset": 0, 
      "position": 13
    }, 
    {
      "end_offset": 16, 
      "token": "dementia in alzh", 
      "type": "word", 
      "start_offset": 0, 
      "position": 14
    }, 
    {
      "end_offset": 17, 
      "token": "dementia in alzhe", 
      "type": "word", 
      "start_offset": 0, 
      "position": 15
    }, 
    {
      "end_offset": 18, 
      "token": "dementia in alzhei", 
      "type": "word", 
      "start_offset": 0, 
      "position": 16
    }, 
    {
      "end_offset": 19, 
      "token": "dementia in alzheim", 
      "type": "word", 
      "start_offset": 0, 
      "position": 17
    }, 
    {
      "end_offset": 20, 
      "token": "dementia in alzheime", 
      "type": "word", 
      "start_offset": 0, 
      "position": 18
    }, 
    {
      "end_offset": 21, 
      "token": "dementia in alzheimer", 
      "type": "word", 
      "start_offset": 0, 
      "position": 19
    }
  ]
}

2 个答案:

答案 0 :(得分:14)

非常感谢帮助我找到正确解决方案的rendel

Andrei Stefan的解决方案不是最佳解决方案。

为什么呢?首先,搜索分析器中没有小写滤波器使得搜索不方便;案件必须严格匹配。需要使用lowercase过滤器的自定义分析器,而不是"analyzer": "keyword"

其次,分析部分错误! 在索引时间期间,通过edge_ngram_analyzer分析字符串“ F00.0-早发性阿尔茨海默病中的痴呆”。使用此分析器,我们将以下字典数组作为分析字符串:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 6, 
      "token": "0 ", 
      "type": "word", 
      "start_offset": 4, 
      "position": 2
    }, 
    {
      "end_offset": 9, 
      "token": "  ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 3
    }, 
    {
      "end_offset": 10, 
      "token": "  d", 
      "type": "word", 
      "start_offset": 7, 
      "position": 4
    }, 
    {
      "end_offset": 11, 
      "token": "  de", 
      "type": "word", 
      "start_offset": 7, 
      "position": 5
    }, 
    {
      "end_offset": 12, 
      "token": "  dem", 
      "type": "word", 
      "start_offset": 7, 
      "position": 6
    }, 
    {
      "end_offset": 13, 
      "token": "  deme", 
      "type": "word", 
      "start_offset": 7, 
      "position": 7
    }, 
    {
      "end_offset": 14, 
      "token": "  demen", 
      "type": "word", 
      "start_offset": 7, 
      "position": 8
    }, 
    {
      "end_offset": 15, 
      "token": "  dement", 
      "type": "word", 
      "start_offset": 7, 
      "position": 9
    }, 
    {
      "end_offset": 16, 
      "token": "  dementi", 
      "type": "word", 
      "start_offset": 7, 
      "position": 10
    }, 
    {
      "end_offset": 17, 
      "token": "  dementia", 
      "type": "word", 
      "start_offset": 7, 
      "position": 11
    }, 
    {
      "end_offset": 18, 
      "token": "  dementia ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 12
    }, 
    {
      "end_offset": 19, 
      "token": "  dementia i", 
      "type": "word", 
      "start_offset": 7, 
      "position": 13
    }, 
    {
      "end_offset": 20, 
      "token": "  dementia in", 
      "type": "word", 
      "start_offset": 7, 
      "position": 14
    }, 
    {
      "end_offset": 21, 
      "token": "  dementia in ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 15
    }, 
    {
      "end_offset": 22, 
      "token": "  dementia in a", 
      "type": "word", 
      "start_offset": 7, 
      "position": 16
    }, 
    {
      "end_offset": 23, 
      "token": "  dementia in al", 
      "type": "word", 
      "start_offset": 7, 
      "position": 17
    }, 
    {
      "end_offset": 24, 
      "token": "  dementia in alz", 
      "type": "word", 
      "start_offset": 7, 
      "position": 18
    }, 
    {
      "end_offset": 25, 
      "token": "  dementia in alzh", 
      "type": "word", 
      "start_offset": 7, 
      "position": 19
    }, 
    {
      "end_offset": 26, 
      "token": "  dementia in alzhe", 
      "type": "word", 
      "start_offset": 7, 
      "position": 20
    }, 
    {
      "end_offset": 27, 
      "token": "  dementia in alzhei", 
      "type": "word", 
      "start_offset": 7, 
      "position": 21
    }, 
    {
      "end_offset": 28, 
      "token": "  dementia in alzheim", 
      "type": "word", 
      "start_offset": 7, 
      "position": 22
    }, 
    {
      "end_offset": 29, 
      "token": "  dementia in alzheime", 
      "type": "word", 
      "start_offset": 7, 
      "position": 23
    }, 
    {
      "end_offset": 30, 
      "token": "  dementia in alzheimer", 
      "type": "word", 
      "start_offset": 7, 
      "position": 24
    }, 
    {
      "end_offset": 33, 
      "token": "s ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 25
    }, 
    {
      "end_offset": 34, 
      "token": "s d", 
      "type": "word", 
      "start_offset": 31, 
      "position": 26
    }, 
    {
      "end_offset": 35, 
      "token": "s di", 
      "type": "word", 
      "start_offset": 31, 
      "position": 27
    }, 
    {
      "end_offset": 36, 
      "token": "s dis", 
      "type": "word", 
      "start_offset": 31, 
      "position": 28
    }, 
    {
      "end_offset": 37, 
      "token": "s dise", 
      "type": "word", 
      "start_offset": 31, 
      "position": 29
    }, 
    {
      "end_offset": 38, 
      "token": "s disea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 30
    }, 
    {
      "end_offset": 39, 
      "token": "s diseas", 
      "type": "word", 
      "start_offset": 31, 
      "position": 31
    }, 
    {
      "end_offset": 40, 
      "token": "s disease", 
      "type": "word", 
      "start_offset": 31, 
      "position": 32
    }, 
    {
      "end_offset": 41, 
      "token": "s disease ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 33
    }, 
    {
      "end_offset": 42, 
      "token": "s disease w", 
      "type": "word", 
      "start_offset": 31, 
      "position": 34
    }, 
    {
      "end_offset": 43, 
      "token": "s disease wi", 
      "type": "word", 
      "start_offset": 31, 
      "position": 35
    }, 
    {
      "end_offset": 44, 
      "token": "s disease wit", 
      "type": "word", 
      "start_offset": 31, 
      "position": 36
    }, 
    {
      "end_offset": 45, 
      "token": "s disease with", 
      "type": "word", 
      "start_offset": 31, 
      "position": 37
    }, 
    {
      "end_offset": 46, 
      "token": "s disease with ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 38
    }, 
    {
      "end_offset": 47, 
      "token": "s disease with e", 
      "type": "word", 
      "start_offset": 31, 
      "position": 39
    }, 
    {
      "end_offset": 48, 
      "token": "s disease with ea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 40
    }, 
    {
      "end_offset": 49, 
      "token": "s disease with ear", 
      "type": "word", 
      "start_offset": 31, 
      "position": 41
    }, 
    {
      "end_offset": 50, 
      "token": "s disease with earl", 
      "type": "word", 
      "start_offset": 31, 
      "position": 42
    }, 
    {
      "end_offset": 51, 
      "token": "s disease with early", 
      "type": "word", 
      "start_offset": 31, 
      "position": 43
    }, 
    {
      "end_offset": 52, 
      "token": "s disease with early ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 44
    }, 
    {
      "end_offset": 53, 
      "token": "s disease with early o", 
      "type": "word", 
      "start_offset": 31, 
      "position": 45
    }, 
    {
      "end_offset": 54, 
      "token": "s disease with early on", 
      "type": "word", 
      "start_offset": 31, 
      "position": 46
    }, 
    {
      "end_offset": 55, 
      "token": "s disease with early ons", 
      "type": "word", 
      "start_offset": 31, 
      "position": 47
    }, 
    {
      "end_offset": 56, 
      "token": "s disease with early onse", 
      "type": "word", 
      "start_offset": 31, 
      "position": 48
    }
  ]
}

如您所见,整个字符串使用2到25个字符的标记大小进行标记。字符串以线性方式标记化,并且每个新标记的所有空格和位置都加1。

它有几个问题:

  1. edge_ngram_analyzer生成无用的令牌,永远不会搜索,例如:“ 0 ”,“”,“ d “,” sd “,” s疾病w “等。
  2. 此外,它没有产生可以使用的许多有用的令牌,例如:“疾病”,“< em>早发“等。如果您尝试搜索任何这些单词,将会有0个结果。
  3. 请注意,最后一个标记是“ s疾病与早期”。最终的“ t ”在哪里?由于"max_gram" : "25"我们“丢失”所有字段中的某些文字。您不能再搜索此文本,因为它没有令牌。
  4. trim过滤器仅在标记器可以完成时过滤额外空格的问题。
  5. edge_ngram_analyzer增加每个标记的位置,这对于诸如短语查询之类的位置查询是有问题的。在生成ngrams时,应使用edge_ngram_filter代替保留令牌的位置
  6. 最佳解决方案。

    要使用的映射设置:

    ...
    "mappings": {
        "Type": {
           "_all":{
              "analyzer": "edge_ngram_analyzer", 
              "search_analyzer": "keyword_analyzer"
            }, 
            "properties": {
              "Field": {
                "search_analyzer": "keyword_analyzer",
                 "type": "string",
                 "analyzer": "edge_ngram_analyzer"
              },
    ...
    ...
    "settings": {
       "analysis": {
          "filter": {
             "english_poss_stemmer": {
                "type": "stemmer",
                "name": "possessive_english"
             },
             "edge_ngram": {
               "type": "edgeNGram",
               "min_gram": "2",
               "max_gram": "25",
               "token_chars": ["letter", "digit"]
             }
          },
          "analyzer": {
             "edge_ngram_analyzer": {
               "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
               "tokenizer": "standard"
             },
             "keyword_analyzer": {
               "filter": ["lowercase", "english_poss_stemmer"],
               "tokenizer": "standard"
             }
          }
       }
    }
    ...
    

    看看分析:

    {
      "tokens": [
        {
          "end_offset": 5, 
          "token": "f0", 
          "type": "word", 
          "start_offset": 0, 
          "position": 0
        }, 
        {
          "end_offset": 5, 
          "token": "f00", 
          "type": "word", 
          "start_offset": 0, 
          "position": 0
        }, 
        {
          "end_offset": 5, 
          "token": "f00.", 
          "type": "word", 
          "start_offset": 0, 
          "position": 0
        }, 
        {
          "end_offset": 5, 
          "token": "f00.0", 
          "type": "word", 
          "start_offset": 0, 
          "position": 0
        }, 
        {
          "end_offset": 17, 
          "token": "de", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 17, 
          "token": "dem", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 17, 
          "token": "deme", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 17, 
          "token": "demen", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 17, 
          "token": "dement", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 17, 
          "token": "dementi", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 17, 
          "token": "dementia", 
          "type": "word", 
          "start_offset": 9, 
          "position": 2
        }, 
        {
          "end_offset": 20, 
          "token": "in", 
          "type": "word", 
          "start_offset": 18, 
          "position": 3
        }, 
        {
          "end_offset": 32, 
          "token": "al", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alz", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alzh", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alzhe", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alzhei", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alzheim", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alzheime", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 32, 
          "token": "alzheimer", 
          "type": "word", 
          "start_offset": 21, 
          "position": 4
        }, 
        {
          "end_offset": 40, 
          "token": "di", 
          "type": "word", 
          "start_offset": 33, 
          "position": 5
        }, 
        {
          "end_offset": 40, 
          "token": "dis", 
          "type": "word", 
          "start_offset": 33, 
          "position": 5
        }, 
        {
          "end_offset": 40, 
          "token": "dise", 
          "type": "word", 
          "start_offset": 33, 
          "position": 5
        }, 
        {
          "end_offset": 40, 
          "token": "disea", 
          "type": "word", 
          "start_offset": 33, 
          "position": 5
        }, 
        {
          "end_offset": 40, 
          "token": "diseas", 
          "type": "word", 
          "start_offset": 33, 
          "position": 5
        }, 
        {
          "end_offset": 40, 
          "token": "disease", 
          "type": "word", 
          "start_offset": 33, 
          "position": 5
        }, 
        {
          "end_offset": 45, 
          "token": "wi", 
          "type": "word", 
          "start_offset": 41, 
          "position": 6
        }, 
        {
          "end_offset": 45, 
          "token": "wit", 
          "type": "word", 
          "start_offset": 41, 
          "position": 6
        }, 
        {
          "end_offset": 45, 
          "token": "with", 
          "type": "word", 
          "start_offset": 41, 
          "position": 6
        }, 
        {
          "end_offset": 51, 
          "token": "ea", 
          "type": "word", 
          "start_offset": 46, 
          "position": 7
        }, 
        {
          "end_offset": 51, 
          "token": "ear", 
          "type": "word", 
          "start_offset": 46, 
          "position": 7
        }, 
        {
          "end_offset": 51, 
          "token": "earl", 
          "type": "word", 
          "start_offset": 46, 
          "position": 7
        }, 
        {
          "end_offset": 51, 
          "token": "early", 
          "type": "word", 
          "start_offset": 46, 
          "position": 7
        }, 
        {
          "end_offset": 57, 
          "token": "on", 
          "type": "word", 
          "start_offset": 52, 
          "position": 8
        }, 
        {
          "end_offset": 57, 
          "token": "ons", 
          "type": "word", 
          "start_offset": 52, 
          "position": 8
        }, 
        {
          "end_offset": 57, 
          "token": "onse", 
          "type": "word", 
          "start_offset": 52, 
          "position": 8
        }, 
        {
          "end_offset": 57, 
          "token": "onset", 
          "type": "word", 
          "start_offset": 52, 
          "position": 8
        }
      ]
    }
    

    在索引时间上,standard令牌化程序会对文本进行标记,然后通过lowercasepossessive_englishedge_ngram过滤器对单独的字词进行过滤。 只为单词生成标记。 在搜索时,standard标记生成器会对文本进行标记,然后lowercasepossessive_english会对单独的文字进行过滤。搜索到的单词与在索引时间内创建的标记匹配。

    因此我们可以进行增量搜索!

    现在,因为我们在单独的单词上做ngram,我们甚至可以执行像

    这样的查询
    {
      'query': {
        'multi_match': {
          'query': 'dem in alzh',  
          'type': 'phrase', 
          'fields': ['_all']
        }
      }
    }
    

    并获得正确的结果。

    没有任何文字“丢失”,一切都是可搜索的,并且不再需要trim过滤器来处理空格。

答案 1 :(得分:8)

我认为您的查询错误:虽然您在索引时需要nGrams,但在搜索时不需要它们。在搜索时,您需要尽可能“固定”文本。 请尝试此查询:

{
  "query": {
    "multi_match": {
      "query": "  dementia in alz",
      "analyzer": "keyword",
      "fields": [
        "_all"
      ]
    }
  }
}

您注意到dementia之前有两个空格。这些由您的分析器从文本中解释。要摆脱那些你需要trim token_filter:

   "edge_ngram_analyzer": {
      "filter": [
        "lowercase","trim"
      ],
      "tokenizer": "edge_ngram_tokenizer"
    }

然后这个查询将起作用(dementia之前没有空格):

{
  "query": {
    "multi_match": {
      "query": "dementia in alz",
      "analyzer": "keyword",
      "fields": [
        "_all"
      ]
    }
  }
}