Question

我试图通过使用脚本按[array] + field聚合用户来查找索引中的重复项。

我的问题是，为什么术语聚合仅按给定键（ smith@gmail.com_SMITH ）计算一次文档。是否有可能改变这种行为。

数据：

POST users/user
{
    "name" :"SMITH",
    "emails" : [
       "smith@gmail.com"
    ]
}

POST users/user
{
    "name" :"SMITH",
    "emails" : [
      "mrsmith@gmail.com",
      "smith@gmail.com"
    ]
}

不同的查询：

POST users/_search
{
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "script": {
          "inline": "doc['emails.keyword'].value + '_' + doc['name.keyword'].value"
        }
      }
    }
  }
}

结果：

"aggregations": {
  "duplicateCount": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "mrsmith@gmail.com_SMITH",
        "doc_count": 1
      },
      {
        "key": "smith@gmail.com_SMITH",
        "doc_count": 1
      }
    ]
  }
}

Answer 1

您似乎只是通过"terms" + "field"获得正确的字词聚合计数。

如果您试用此查询，则可以看到"terms" + "field"和"terms" + "script"之间的差异：

{
  "from" : 0,
  "size" : 0,
  "_source" : true,
  "query" : {
    "bool" : {
      "must" : [ {
        "match" : {
          "name" : {
            "query" : "SMITH",
            "operator" : "OR",
            "fuzziness" : "AUTO",
            "prefix_length" : 1,
            "max_expansions" : 50,
            "fuzzy_transpositions" : true,
            "lenient" : false,
            "zero_terms_query" : "NONE",
            "boost" : 1
          }
        }
      } ]
    }
  },
  "aggs": {
    "duplicateCount": {
      "terms": {
        "script": {
          "inline": "doc['emails.keyword'].value + '_' + doc['name.keyword'].value"
        }
      }
    },
    "duplicateCount2": {
      "terms": {
        "field": "emails.keyword"
      }
    }
  }
}

以下是结果。见duplicateCount2：

{
  "took" : 53,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "duplicateCount2" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "smith@gmail.com",
        "doc_count" : 2
      }, {
        "key" : "mrsmith@gmail.com",
        "doc_count" : 1
      } ]
    },
    "duplicateCount" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "mrsmith@gmail.com_SMITH",
        "doc_count" : 1
      }, {
        "key" : "smith@gmail.com_SMITH",
        "doc_count" : 1
      } ]
    }
  }
}

Answer 2

确定。所以我通过迭代术语数组并手动创建所需的键来解决它：

def keys = []; 
for (p in doc['emails.keyword'].values) {
    keys.add(p + doc['name.keyword'].value);
} 
return keys;

结果如下：

 "buckets": [
    {
      "key": "smith@gmail.com_SMITH",
      "doc_count": 2
    },
    {
      "key": "mrsmith@gmail.com_SMITH",
      "doc_count": 1
    }
  ]

错误的术语在elasticsearch中聚合doc_count

2 个答案: