Elasticsearch深度聚合文档计数不匹配

时间:2016-02-25 23:36:41

标签: elasticsearch

我在ES 1.7.2的安装上做了几个聚合来汇总一些值。

发现在某些随机情况下,每个聚合的doc_count与嵌套级别的doc_count的SUM不匹配的困难方法。

"key": 503,
"doc_count": 383778,
"regionid": {...}

所以doc_count = 383778

如果我对下面列表的regionid的每个元素的doc_count,我有doc_count = 383718

 "key": 503,
 "doc_count": 383778,
 "regionid": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
       {
          "key": 1,
          "doc_count": 303821,
          "ProviderId": {...}
       },
       {
          "key": 27,
          "doc_count": 23834,
          "ProviderId": {...}
       },
       {
          "key": 25,
          "doc_count": 9565,
          "ProviderId": {...}
       },
       {
          "key": 36,
          "doc_count": 8857,
          "ProviderId": {...}
       },
       {
          "key": 14,
          "doc_count": 8222,
          "ProviderId": {...}
       },
       {
          "key": 68,
          "doc_count": 6746,
          "ProviderId": {...}
       },
       {
          "key": 19,
          "doc_count": 4574,
          "ProviderId": {...}
       },
       {
          "key": 28,
          "doc_count": 4164,
          "ProviderId": {...}
       },
       {
          "key": 10,
          "doc_count": 3006,
          "ProviderId": {...}
       },
       {
          "key": 31,
          "doc_count": 2020,
          "ProviderId": {...}
       },
       {
          "key": 21,
          "doc_count": 1410,
          "ProviderId": {...}
       },
       {
          "key": 32,
          "doc_count": 1368,
          "ProviderId": {...}
       },
       {
          "key": 22,
          "doc_count": 1367,
          "ProviderId": {...}
       },
       {
          "key": 8,
          "doc_count": 1010,
          "ProviderId": {...}
       },
       {
          "key": 16,
          "doc_count": 825,
          "ProviderId": {...}
       },
       {
          "key": 35,
          "doc_count": 559,
          "ProviderId": {...}
       },
       {
          "key": 34,
          "doc_count": 517,
          "ProviderId": {...}
       },
       {
          "key": 26,
          "doc_count": 414,
          "ProviderId": {...}
       },
       {
          "key": 18,
          "doc_count": 371,
          "ProviderId": {...}
       },
       {
          "key": 15,
          "doc_count": 362,
          "ProviderId": {...}
       },
       {
          "key": 33,
          "doc_count": 185,
          "ProviderId": {...}
       },
       {
          "key": 9,
          "doc_count": 143,
          "ProviderId": {...}
       },
       {
          "key": 29,
          "doc_count": 102,
          "ProviderId": {...}
       },
       {
          "key": 17,
          "doc_count": 100,
          "ProviderId": {...}
       },
       {
          "key": 30,
          "doc_count": 96,
          "ProviderId": {...}
       },
       {
          "key": 20,
          "doc_count": 80,
          "ProviderId": {...}
       }
    ]
 }
},

你们知道为什么会这样吗?

也许是个错误?

我聚合的一部分:

 {
    "aggs": {
       "Provider": {
          "terms": {
             "field": "Provider"
          },
          "aggs": {
             "Gateway": {
                "terms": {
                   "field": "Gateway"
                },
                "aggs": {
                   "CustomerId": {
                      "terms": {
                         "field": "CustomerId"
                      },
                      "aggs": {
                         "regionid": {
                            "terms": {
                               "field": "regionid"

感谢任何帮助。 感谢

1 个答案:

答案 0 :(得分:1)

ES中的聚合并不精确,它们是基于采样记录数的估算值。如果样本量足够大,那么这个数字可能是准确的,但这会产生重大的性能影响。

您可以阅读有关" Shard Size"的更多信息。在ES documentation on shard_size for terms aggregation

您的索引更扁平(意味着聚合返回的桶越多),您需要增加碎片大小。我们发现,对于我们系统中的平坦指数,20倍乘数是一个很好的经验法则。因此,如果我返回聚合的前10条记录,我们使用的分片大小为200.