按分数筛选Ngram

时间:2019-04-19 22:37:57

标签: elasticsearch

我的搜索字符串为if (function_exists('finfo_open')) { $mime = finfo_open(FILEINFO_MIME_TYPE); $mime_type = finfo_file($mime, "FILE-PATH"); if($mime_type == array("application/pdf", "image/jpeg", "image/png")) echo "file is pdf"; else echo "file is not pdf"; finfo_close($mime); } ,当前的搜索结果包括:

Resta

原因是由于我的索引:

"Save at any restaurant!", 
"Save at any gas station!"

当我用{ "rewards": { "aliases": {}, "mappings": { "_doc": { "properties": { "name": { "type": "text", "fields": { "name": { "type": "text", "analyzer": "ngram_analyzer" } } }, } } }, "settings": { "index": { "number_of_shards": "5", "provided_name": "rewards", "creation_date": "1555542654894", "analysis": { "filter": { "ngram_filter": { "type": "ngram", "min_gram": "2", "max_gram": "20" } }, "analyzer": { "ngram_analyzer": { "filter": [ "lowercase", "ngram_filter" ], "type": "custom", "tokenizer": "standard" } } }, "number_of_replicas": "1", "uuid": "Nzf6KNHkQIeKP0HbVFK1lw", "version": { "created": "6060299" } } } } } 来肯定地查看文档时,我将Save at any gas station!视为ngram。

sta

(为简洁起见,我省略了许多其他内容)

使用的查询:

{
  "_index": "rewards",
  "_type": "_doc",
  "_id": "6",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "name": {
      "field_statistics": {
        "sum_doc_freq": 73,
        "doc_count": 3,
        "sum_ttf": 73
      },
      "terms": {
        "any": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 8,
              "end_offset": 11
            }
          ]
        },
        "save": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 4
            }
          ]
        },
        "sta": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 4,
              "start_offset": 16,
              "end_offset": 23
            }
          ]
        },
      }
    }
  }
}

搜索时我得到一个分数

{
  "bool": {
    "should": [
      {
        "multi_match": {
          "query": "restaurant",
          "fields": [
            "name",
            "category",
          ],
          "operator": "and"
        }
      }
    ]
  }
}

这里的用户实际上正在寻找["Save at any restaurant!", 1.1967528] ["Save at any gas station!", 0.7141209] ,我想知道如何按分数过滤或排除结果。我似乎找不到很好的分数定义(似乎是相对的),但是如何(最终)不显示Restaurant

即使给它一个完整的搜索词组Save at any gas station!,分数也只会好一点:

restaurant

1 个答案:

答案 0 :(得分:1)

您只需在映射中创建一个Edge-Ngram分析器,并在搜索请求中仅使用此唯一的

ngram的作用是仅使用单词的开头字母创建以下标记。

例如re, res, rest, resta, restau, restaur, restaura, restauran, restaurant

我添加了一个边缘n-gram分析器,并注意到我在任何字段中都不使用该分析器。在搜索查询期间,我将仅使用此分析器。

这意味着它将仅以倒排索引搜索餐厅的上述令牌。

下面是一个示例映射及其查询。

映射

PUT <your_index_name>
{  
   "mappings":{  
      "mydocs":{  
         "properties":{  
            "name":{  
               "type":"text",
               "fields":{  
                  "name":{  
                     "type":"text",
                     "analyzer":"ngram_analyzer"
                  }
               }
            }
         }
      }
   },
   "settings":{  
      "index":{  
         "number_of_shards":"5",
         "analysis":{  
            "filter":{  
               "ngram_filter":{  
                  "type":"ngram",
                  "min_gram":"2",
                  "max_gram":"20"
               },
               "edgengram_filter":{  
                  "type":"edge_ngram",
                  "min_gram":"2",
                  "max_gram":"20"
               }
            },
            "analyzer":{  
               "ngram_analyzer":{  
                  "filter":[  
                     "lowercase",
                     "ngram_filter"
                  ],
                  "type":"custom",
                  "tokenizer":"standard"
               },
               "edgengram_analyzer":{  
                  "filter":[  
                     "lowercase",
                     "edgengram_filter"
                  ],
                  "type":"custom",
                  "tokenizer":"standard"
               }
            }
         },
         "number_of_replicas":"1"
      }
   }
}

下面是查询的样子:

查询

POST <your_index_name>/_search
{  
   "query":{  
      "bool":{  
         "should":[  
            {  
               "multi_match":{  
                  "query":"restaurant",
                  "fields":[  
                     "name",
                     "category"
                  ],
                  "operator":"and",
                  "analyzer":"edgengram_analyzer"   <---- Added this
               }
            }
         ]
      }
   }
}

您将能够看到所需的结果。

希望有帮助。