Hive Group经过自我加入

时间:2014-05-12 15:45:43

标签: hadoop hive

民间,

我们有一个要求,我们想在自己加入HIVE表后应用group by子句。

e.g。数据

CUSTOMER_NAME,PRODUCT_NAME,PURCHASE_PRICE

customer1,product1,20
customer1,product2,30
customer1,product1,25

现在,我们想要通过考虑CUSTOMER_NAME,PRODUCT_NAME

的所有产品和后来的组结果集来获取客户(在完成价格总和,产品名称不在子查询中之后的前5位客户)
select customer_name,product_name,sum(purchase_price)
from customer_prd cprd
Join (select customer_name,sum(purchase_prices) order by sum group by customer_name limit 5) cprdd
where cprd.customer_name = cprdd.customer_name group by customer_name,product_name

在HIVE中收到错误说不能这样分组?

2 个答案:

答案 0 :(得分:2)

加入后,您的列名称变得模糊不清。 Hive不知道你是否关心连接的左侧或右侧。在这种情况下,它并不重要,因为你对它们进行内部联接是相同的,但是hive并不够聪明,无法弄明白。试试这个:

select cprd.customer_name, cprd.product_name, sum(purchase_price)
from customer_prd cprd
Join (select customer_name, sum(purchase_price) as sum from customer_prd group by customer_name order by sum desc limit 5) cprdd
where cprd.customer_name = cprdd.customer_name group by cprd.customer_name, cprd.product_name;

答案 1 :(得分:2)

我认为Joe K是正确的,但我会重新考虑你在做什么,并完全避免加入,并使用'收集'或者' collect_max'在Brickhouse库中可以使用UDF(http://github.com/klout/brickhouse)。首先按产品求和,然后同时收集和求和。

SELECT customer_name, sum(purchases) as total_purchases, collect( product_name, purchases) as product_map
FROM
  ( SELECT customer_name, product_name, sum(purchase_prices) AS purchases
    FROM customer_prd
    GROUP BY customer_name, product_name
  ) sp
GROUP BY customer_name
ORDER BY sum(purchases)
LIMIT 5;

这仍然会导致排序以获得前5名客户。如果你有一大群小客户,但有一些大型客户鲸鱼,你可以添加一笔“购买”(&); '减少要分类的记录的大小。

相关问题