过滤"术语聚合"按键长度

时间:2014-10-31 13:17:16

标签: elasticsearch aggregation

我有一个ES索引,其中包含来自某些科学实验的参数数据。

我有以下术语聚合:

{
    "aggs": {
        "variables": {
            "terms": {
                "field": "value",
                "size": 100
            }
        }
    },
    "size": 0
}

返回如下结果:

{
    "took" : 3,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "failed" : 0
    },
    "hits" : {
        "total" : 9928,
        "max_score" : 0.0,
        "hits" : [ ]
    },
    "aggregations" : {
        "variables" : {
            "buckets" : [ {
                "key" : "00",
                "doc_count" : 158
            }, {
                "key" : "1",
                "doc_count" : 158
            }, {
                "key" : "2",
                "doc_count" : 158
            }, {
                "key" : "pressure",
                "doc_count" : 158
            }, {
                "key" : "seconds",
                "doc_count" : 158
            }, {
                "key" : "since",
                "doc_count" : 158
            }, {
                "key" : "s",
                    "doc_count" : 156
            }, {
                "key" : "speed",
                    "doc_count" : 127
            }, {
                "key" : "sample",
                    "doc_count" : 121
            }, {
                "key" : "a",
                    "doc_count" : 104
            } ]
        }
    }
}

我想要做的是告诉ElasticSearch忽略长度小于5的所有密钥;

e.g。过滤掉"key": "a""key": "s"等。

这可能吗?

2 个答案:

答案 0 :(得分:1)

我认为你应该使用Regexp Filter获得想要的结果:

    "filter": {
        "regexp":{
            "value" : ".{2,}"
        }
    }

答案 1 :(得分:1)

好的,所以我解决了这个问题。我使用自定义分析器重新索引数据,其内容如下:

PUT $host/$index

{
    "settings": {
        "analysis": {
            "filter": {
                "min_length_5_filter": {
                    "type": "length",
                    "min": 5,
                    "max": 256
                }
            },
            "analyzer": {
                "variable_name_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["min_length_5_filter"]
                }
            }
        }
    }
}

然后在索引映射中:

PUT $host/$index/_mapping/$mapping_name

...
"parameters": {
    "properties": {
        "name": {
            "type": "string",
            "analyzer": "variable_name_analyzer"
        },
        "value": {
            "type": "string",
            "analyzer": "variable_name_analyzer"
        }
    }
},
...

使用上面的方法,使用最小长度过滤标记化字符串允许我删除大量垃圾值,现在"术语聚合"工作得很好。希望这有助于某人!