配置单元 - 列级子查询解决方法

时间:2018-04-02 04:40:07

标签: hive hiveql

我在列级子查询中偶然发现,让我说我想要一个这样的结果:

enter image description here

来自自联接表的

包含(日期,商店和交易)

我知道使用传统数据仓库(使用列级子查询)可以实现,但我发现hive缺少此功能,所以我创建了我自己的查询:

select main_table.date,main_table.store,main_table.transaction,yest_table.transaction as yesterday_trans, lw_table.transaction as lastweek_trans, lm_table.transaction as lastmonth_trans
    from
    (select date, store, transaction from table where date=current_date)main_table
    left join
    (select date, store, transaction from table where date=date_sub(current_date,1))yest_table
    on date_sub(main_table.date,1)=yest_table.date and main_table.store=yest_table.store
    left join
    (select date, store, transaction from table where date=date_sub(current_date,7))lw_table
    on date_sub(main_table.date,7)=lw_table.date and main_table.store=yest_table.store
    left join
    (select date, store, transaction from table where date=date_sub(current_date,7))lm_table
    on add_months(current_date,-1)=lm_table.date and main_table.store=yest_table.store

是对的吗?因为我认为可能有更好的解决方案..

谢谢

1 个答案:

答案 0 :(得分:1)

使用case + max()汇总:

select main.date,main.store,main.transaction,s.yesterday_trans,s.lastweek_trans,s.lastmonth_trans
    from
    (select date, store, transaction from table where date=current_date)main
    left join
    (select store, 
       max(case when date = date_sub(current_date,1)    then transaction end) yesterday_trans,  
       max(case when date = date_sub(current_date,7)    then transaction end) lastweek_trans,
       max(case when date = add_months(current_date,-1) then transaction end) lastmonth_trans
       from table 
      where date>=add_months(current_date,-1) and date<=date_sub(current_date,1)
      group by store
    ) s on main.store=s.store;

以这种方式,您将消除两个不必要的表扫描和连接。 此解决方案仅适用于current_date(或固定参数而不是current_date)。如果您想从主表中选择许多日期,那么按日期+存储三个连接的解决方案将最有效。

嗯,很可能,LAG也是适用的解决方案......

select date,store,transaction,
    case when lag(date,1) over(partition by store order by date) = date_sub(date,1)) --check if LAG(1) is yesterday (previous date)
         then lag(transaction ,1) over(partition by store order by date) = date_sub(current_date,1)) 
    end as yesterday_trans 
...
--where date>=add_months(current_date,-1) and date<=date_sub(current_date,1)

必要时添加聚合。如果具有LAG的解决方案适用,那么它将是最快的,因为根本不需要连接并且在单次扫描中完成所有操作。如果每个日期有很多记录,那么可能您可以在LAG之前预先聚合它们。这不仅适用于current_date