Elasticsearch Ngrams:自动完成的意外行为

时间:2019-05-08 21:29:35

标签: elasticsearch

这是我所拥有的简化:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "title": "Quick Foxes" 
}

PUT my_index/_doc/2
{
  "title": "Quick Fuxes" 
}

PUT my_index/_doc/3
{
  "title": "Foxes Quick" 
}

PUT my_index/_doc/4
{
  "title": "Foxes Slow" 
}

我正在尝试搜索Quick Fo来测试自动完成功能:

 GET my_index/_search
    {
      "query": {
        "match": {
          "title": {
            "query": "Quick Fo", 
            "operator": "and"
          }
        }
      }
    }

问题是此查询还会返回我期望“ Quick Foxes”的Foxes Quick

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "title": "Quick Foxes"
        }
      },
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "Foxes Quick"   <<<----- WHY???
        }
      }
    ]
  }
}

我可以进行哪些调整,以便查询经典的“自动完成”,其中“ Quick Fo”肯定不会返回“ Foxes Quick” .....而仅返回“ Quick Foxes”?

----附加信息-----------------------

这对我有用:

PUT my_index1
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", 
          "search_analyzer": "standard" 
        }
      }
    }
  }
}


PUT my_index1/_doc/1
{
  "text": "Quick Brown Fox" 
}

PUT my_index1/_doc/2
{
  "text": "Quick Frown Fox" 
}


PUT my_index1/_doc/3
{
  "text": "Quick Fragile Fox" 
}


GET my_index1/_search
{
  "query": {
    "match": {
      "text": {
        "query": "quick br", 
        "operator": "and"
      }
    }
  }
}

1 个答案:

答案 0 :(得分:3)

该问题归因于您的搜索分析器 autocomplete_search ,其中您使用的是小写标记器,因此您的搜索词 Quick Fo 将分为两个词,即 quick fo (注意小写),并将它们与在索引文档中使用autocomplete analyzer生成的令牌进行匹配。

现在的标题 Foxes Quick 使用autocomplete analyzer,并且将同时具有 quick fo 标记,因此与搜索匹配术语令牌。

您只需使用_analyzer API,即可检查为您的文档和搜索词生成的令牌,以更好地理解它。

有关如何实现自动完成的信息,请参考官方ES文档https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html,他们也使用不同的搜索时间分析器,但是它有一定的局限性,不能解决所有用例(尤其是如果您有像您这样的文档),因此我根据业务需求使用了其他设计来实现它。

希望我很清楚地解释了为什么它要返回您的案例中的第二个文档。

编辑在您的情况下,IMO Match phrase prefix也会更加有用。