Elasticsearch - 不完全匹配多字同义词

时间:2016-02-29 23:04:06

标签: elasticsearch search-multiple-words

我正在尝试实施同义词过滤器,以便为我的项目提供更准确的搜索引擎。

我为包含“contratàduréedéterminée”字样的文件编制了索引。我想要的是,当我搜索首字母缩略词“ cdd ”时,所有包含确切词语contratàduréetéminminée”的文档都会匹配。< / p>

这是我的索引设置:

'analysis' => array(
    'analyzer' => array(
        'indexAnalyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'nGram',
            'filter' => array('asciifolding', 'lowercase', 'synonym', 'snowball', 'elision', 'worddelimiter', 'stopwords'),
        ),
        'searchAnalyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => array('asciifolding', 'lowercase', 'elision', 'worddelimiter', 'synonym', 'stopwords'),
        ),
        'rawAnalyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'keyword',
            'filter' => array('lowercase', 'trim'),
        ),
        'exactSearchAnalyzer' => array(
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => array('elision'),
        ),
    ),
    'tokenizer' => array(
        'nGram' => array(
            'type' => 'nGram',
            'min_gram' => 3,
            'max_gram' => 20,
            'token_chars' => array('letter', 'digit'),
        ),
    ),
    'filter' => array(
        'snowball' => array(
            'type' => 'snowball',
            'language' => 'French',
        ),
        'elision' => array(
            'type' => 'elision',
            'articles' => array('l', 'm', 't', 'qu', 'n', 's', 'j', 'd'),
        ),
        'stopwords' => array(
            'type' => 'stop',
            'stopwords' => array('_french_'),
            'ignore_case' => true,
        ),
        'worddelimiter' => array(
            'type' => 'word_delimiter',
        ),
        'synonym' => array(
            'tokenizer' => 'keyword',
            'type' => 'synonym',
            'synonyms_path' => sfConfig::get('app_elasticsearch_path_synonym'),
            'ignore_case' => true,
        ),
    ),
),

同义词文件包含一行“ CDD,ContratàDuréeDéterminée”。

这里是我的索引映射的一部分:

{
   "clic": {
      "mappings": {
         "idea": {
            "properties": {
               "expected_benefits": {
                  "properties": {
                     "search": {
                        "type": "string",
                        "analyzer": "searchAnalyzer",
                        "include_in_all": true
                     }
                     ...
                  }
               },
               "initial_situation": {
                  "properties": {
                     "search": {
                        "type": "string",
                        "analyzer": "searchAnalyzer",
                        "include_in_all": true
                     }
                     ...
                  }
               },
               "proposed_solution": {
                  "properties": {
                     "search": {
                        "type": "string",
                        "analyzer": "searchAnalyzer",
                        "include_in_all": true
                     }
                     ...
                  }
               },
               "title": {
                  "properties": {
                     "name": {
                        "type": "string",
                        "analyzer": "searchAnalyzer",
                        "include_in_all": true
                     },
                     ...
                  }
               }
            }
         }
      }
   }
}

文件样本:

{
   "took": 6,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.030160192,
      "hits": [
         {
            "_index": "clic",
            "_type": "idea",
            "_id": "3863",
            "_score": 0.030160192,
            "_source": {
               "id": "3863",
               "title": {
                  "name": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod",
                  ...
               },
               "initial_situation": {
                  "search": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
                  tempor incididunt ut labore et dolore magna aliqua.",
                  ...
               },
               "proposed_solution": {
                  "search": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
                  tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
                  quis nostrud contrat à durée déterminée, Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
                  tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim",
                  ...
               },
               "expected_benefits": {
                  "search": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
                  tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
                  quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
                  consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
                  cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
                  proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\r\n",
                  ...
               },
               ...
            }
         }
      ]
   }
}

当我像这样使用 analyze API 时:     GET / clic / _analyze?analyzer = searchAnalyzer&amp; text = cdd

输出正确的同义词

{
   "tokens": [
      {
         "token": "cdd",
         "start_offset": 0,
         "end_offset": 3,
         "type": "SYNONYM",
         "position": 1
      },
      {
         "token": "contrat à durée déterminée",
         "start_offset": 0,
         "end_offset": 3,
         "type": "SYNONYM",
         "position": 1
      }
   ]
}

到目前为止,对我来说似乎是对的。 此外,当我使用验证API 来解释我的查询时:

GET clic/idea/_validate/query?explain
{
   "query": {
      "filtered": {
         "query": {
            "bool": {
               "should": [
                  {
                     "multi_match": {
                        "query": "cdd",
                        "type": "cross_fields",
                        "fields": [
                           "title.name^3",
                           "initial_situation.search^3",
                           "proposed_solution.search^3",
                           "expected_benefits.search^3"
                        ],
                        "operator": "and",
                        "analyzer": "searchAnalyzer"
                     }
                  }
               ]
            }
         }
      }
   }
}

输出:

{
   "valid": true,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "explanations": [
      {
            "index": "clic",
            "valid": true,
            "explanation": "filtered((
                blended(terms: [proposed_solution.search:cdd, title.name:cdd, expected_benefits.search:cdd, initial_situation.search:cdd]) 
                blended(terms: [proposed_solution.search:contrat à durée déterminée, title.name:contrat à durée déterminée, expected_benefits.search:contrat à durée déterminée, initial_situation.search:contrat à durée déterminée])
            ))
            ->cache(_type:idea)"
      }
   ]
}

根据我的理解,ES在我提到的所有fiels中搜索“cdd”和“contratàduréedéterminée”。因此,它应该找到包含“cdd”或“contratàduréedéminminée”的文件。 但事实并非如此。当我使用相同的查询进行帖子搜索时,它会点击0结果

我希望我的解释清楚。任何帮助将不胜感激:)谢谢!

0 个答案:

没有答案
相关问题