Hive:对指定组进行求和(HiveQL)

时间:2014-08-01 14:03:51

标签: hadoop hive hiveql hortonworks-data-platform

我有一张桌子:

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

我想找到每组" key"的所有产品的总和。并附加到每一行。例如,对于key = 1,找到所有产品的成本总和(20 + 10 + 5 = 35),然后将结果附加到与key = 1对应的所有行。最终结果:

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

我更愿意在不使用子连接的情况下执行此操作,因为这样效率很低。我最好的想法是将over函数与sum函数结合使用,但我无法使其工作。我最好的尝试:

SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;

我看过docs,但是如此神秘,我不知道如何解决这个问题。我正在使用Hive v0.12.0,HDP v2.0.6,HortonWorks Hadoop发行版。

6 个答案:

答案 0 :(得分:9)

与@VB_ answer类似,请使用BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement

因此,HiveQL查询是:

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

答案 1 :(得分:4)

如果没有自我加入,您可以使用BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW来实现这一目标。

代码如下:

SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;

答案 2 :(得分:2)

分析函数sum给出累积和。例如,如果您这样做:

select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test

然后你会得到:

key    product_code    cost     total_costs
1      UK              20       20
1      US              10       30
1      EU              5        35
2      UK              3        3
2      EU              6        9

似乎不是你想要的。

相反,您应该使用聚合函数sum,并结合自连接来实现此目的:

select test.key, test.product_code, test.cost, agg.total_cost
from (
  select key, sum(cost) as total_cost
  from test
  group by key
) agg
join test
on agg.key = test.key;

答案 3 :(得分:1)

上表似乎是

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

用户想要一个包含总费用的表格,如下所示

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

因此我们使用了以下查询

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;

到目前为止一切顺利。 我想要一个更多的列,计算每个国家的出现次数

key    product_code    cost     total_costs     occurences
1      UK              20       35              2
1      US              10       35              1
1      EU              5        35              2
2      UK              3        9               2
2      EU              6        9               2

因此我使用了以下查询

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as total_costs
COUNT(product code) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as occurences
FROM test;

可悲的是,这不起作用。我得到一个神秘的错误。要在我的查询中排除错误,我想问我是否做错了什么。 感谢

答案 4 :(得分:1)

类似的答案(如果我们使用oracle emp表):

select deptno, ename, sal, sum(sal) over(partition by deptno) from emp;

输出将如下所示:

deptno  ename   sal sum_window_0
10  MILLER  1300    8750
10  KING    5000    8750
10  CLARK   2450    8750
20  SCOTT   3000    10875
20  FORD    3000    10875
20  ADAMS   1100    10875
20  JONES   2975    10875
20  SMITH   800     10875
30  BLAKE   2850    9400
30  MARTIN  1250    9400
30  ALLEN   1600    9400
30  WARD    1250    9400
30  TURNER  1500    9400
30  JAMES   950     9400

答案 5 :(得分:0)

此查询为我提供了完美的结果

select key, product_code, cost, sum(cost) over (partition by key) as total_costs from zone;

相关问题