问题

Question

此前提出的类似问题：

计算单个项目的项目：jq count the number of items in json by a specific key

计算对象值的总和： How do I sum the values in an array of maps in jq?

问题

如何模拟COUNT聚合函数，该函数应该与其SQL原始函数类似？让我们进一步扩展这个问题，以包含其他常规SQL函数：

COUNT
SUM / MAX / MIN / AVG
ARRAY_AGG

最后一个不是标准的SQL函数 - 它来自PostgreSQL但非常有用。

输入时会出现一组有效的JSON对象。为了示范，我们选择一个关于业主及其宠物的简单故事。

模型和数据

基本关系：所有者

id name  age
 1 Adams  25
 2 Baker  55
 3 Clark  40
 4 Davis  31

基本关系：宠物

id name  litter owner_id
10 Bella      4        1
20 Lucy       2        1
30 Daisy      3        2
40 Molly      4        3
50 Lola       2        4
60 Sadie      4        4
70 Luna       3        4

来源

从上面我们得到一个衍生关系 Owner_Pet （上述关系的SQL JOIN的结果）以JSON格式呈现给我们的jq查询（源数据）：

{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy",  "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola",  "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna",  "litter": 3 }

请

以下是示例请求及其预期输出：

计算每位业主的宠物数量：

{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }

记录每个拥有者的小龙虾数量和获取他们的最大值（MIN / AVG）：

{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }

每位业主ARRAY_AGG宠物：

{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }

Answer 1

扩展 jq 解决方案：

自定义 count() 功能：

jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0] 
                       | .pets_count = $l 
                       | del(.pet_id, .pet, .litter); 
        count("owner_id")' source.data

输出：

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

自定义 sum() 功能：

jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0] 
                     | . + {litter_total: $litters | add, litter_max: $litters | max} 
                     | del(.pet_id, .pet, .litter); 
        sum("owner_id")' source.data

输出：

{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}

自定义 array_agg() 功能：

jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0] 
                           | .pets = $pets | del(.pet_id, .pet, .litter); 
        array_agg("owner_id")' source.data

输出：

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}

Answer 2

这是一个很好的练习，但是SO不是编程服务，因此我将重点关注jq中通用解决方案的一些关键概念，即使对于非常大的集合也是如此。

GROUPS_BY

这里提高效率的关键是避免使用内置group_by，因为它需要排序。由于jq基本上是面向流的，因此GROUPS_BY的以下定义同样是面向流的。它利用了基于密钥的查找的效率，同时避免在字符串上调用tojson：

# emit a stream of the groups defined by f
def GROUPS_BY(stream; f): 
  reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][] ;

`distinct`和`count_distinct`

# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream): 
  reduce stream as $x ({};
      ($x|type) as $t
      | (if $t == "string" then $x else ($x|tojson) end) as $y
      | if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
   | [.[][]] | add ;


# Emit the number of distinct items in the given stream
def count_distinct(stream):
   def sum(s): reduce s as $x (0;.+$x);
   reduce stream as $x ({};
       ($x|type) as $t
       | (if $t == "string" then $x else ($x|tojson) end) as $y
       | .[$t][$y] = 1 )
   | sum( .[][] ) ;

便利功能

def owner: {owner_id,owner,age};

示例：“计算每个所有者的宠物数量”

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}

调用：jq -nc -f program1.jq input.json

输出：

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

示例：“汇总每个所有者的小轮数并得到他们的MAX”

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
  + {litter_total: (map(.litter) | add)}
  + {litter_max:  (map(.litter) | max)}

调用：jq -nc -f program2.jq input.json

输出：给定。

示例：“每个所有者ARRAY_AGG宠物”

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}

调用：jq -nc -f program3.jq input.json

输出：

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}

Answer 3

这是替代方案，不对基本JQ使用任何自定义函数。（我有幸摆脱了问题中多余的部分）

计数

In> jq -s 'group_by(.owner_id) |  map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]

总和

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]

最大

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]

汇总

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]

当然，这些可能不是最有效的实现，但它们很好地展示了如何自行实现自定义功能。不同功能之间的所有更改都位于最后一个map内部和管道|（length，add，max）之后的功能

第一个映射遍历不同的组，从第一个项目取名称，然后再次使用map遍历相同组的项目。不像SQL一样漂亮，但并不复杂。

我今天学习了JQ，并且已经设法做到这一点，所以这对于任何入门的人来说都应该是鼓舞人心的。 JQ既不像sed也不像SQL，但也不是很难。

jq中的SQL样式GROUP BY聚合函数（COUNT，SUM等）

问题

模型和数据

来源

请

3 个答案:

GROUPS_BY

`distinct`和`count_distinct`

便利功能

示例：“计算每个所有者的宠物数量”

示例：“汇总每个所有者的小轮数并得到他们的MAX”

示例：“每个所有者ARRAY_AGG宠物”

jq中的SQL样式GROUP BY聚合函数（COUNT，SUM等）

问题

模型和数据

来源

请

3 个答案:

GROUPS_BY

distinct和count_distinct

便利功能

示例：“计算每个所有者的宠物数量”

示例：“汇总每个所有者的小轮数并得到他们的MAX”

示例：“每个所有者ARRAY_AGG宠物”

`distinct`和`count_distinct`