Question

我有一个关于Hive的问题。让我向你解释一下这个场景：

我在Oozie上使用Hive动作;我有一个查询正在做在不同的表上成功 LEFT JOIN ;
要插入的总行数约为 3500万;
首先，由于内存不足导致作业崩溃，所以我设置了“ set hive.auto.convert.join = false ”查询完美执行但是花了 4小时完成;
我尝试重写LEFT JOIN的顺序，最后放大表，但结果相同，大约需要4个小时才能执行;

以下是查询的内容：

INSERT OVERWRITE TABLE final_table
SELECT 
T1.Id,
T1.some_field_name,
T1.another_filed_name,

T2.also_another_filed_name,

FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table

那么，知道查询的结构是否有办法重写它以便我可以避免太多的JOIN？

提前致谢

PS：偶数矢量化给了我相同的时间

Answer 1

评论太长，以后会被删除。

（1）您当前的查询无法编译 （2）您没有从T3和T4中选择任何内容，这没有任何意义。
（3）更改表格的顺序不会对基于成本的优化程序产生任何影响 （4）基本上我建议收集有关表格的统计信息，特别是id列，但在您的情况下我感觉id并不是唯一的超过1桌。

将以下查询的结果添加到您的帖子中：

select      *
           ,    case when cnt_1 = 0 then 1 else cnt_1 end
            *   case when cnt_2 = 0 then 1 else cnt_2 end
            *   case when cnt_3 = 0 then 1 else cnt_3 end
            *   case when cnt_4 = 0 then 1 else cnt_4 end   as product


from       (select      id
                       ,count(case when tab = 1 then 1 end) as cnt_1
                       ,count(case when tab = 2 then 1 end) as cnt_2
                       ,count(case when tab = 3 then 1 end) as cnt_3
                       ,count(case when tab = 4 then 1 end) as cnt_4

            from       (            select 1 as tab,id from table1
                        union all   select 2 as tab,id from table2  
                        union all   select 3 as tab,id from table3
                        union all   select 4 as tab,id from table4 
                        ) t

            group by    id

            having      greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
            ) t 

order by    product desc

limit       10
;

重写连接查询

1 个答案: