使用范围连接减少记录数量

时间:2017-06-07 04:26:29

标签: join hive left-join between impala

关注my question 我有以下表格,第一个(范围)包括值范围和其他列:

row  | From   |  To     | Country ....
-----|--------|---------|---------
1    | 1200   |   1500  |
2    | 2200   |   2700  |
3    | 1700   |   1900  |
4    | 2100   |   2150  |
... 

FromTo是bigint并且是独占的。 Range表包含1.8M记录。附加表(值)包含2.7M记录,如下所示:

 row     | Value  | More columns....
 --------|--------|----------------
    1    | 1777   |    
    2    | 2122   |    
    3    | 1832   |    
    4    | 1340   |    
    ... 

我想创建一个表如下:

row      | Value  | From   | To    | More columns....
 --------|--------|--------|-------|---
    1    | 1777   | 1700   | 1900  |
    2    | 2122   | 2100   | 2150  |   
    3    | 1832   | 1700   | 1900  |   
    4    | 1340   | 1200   | 1500  |   
    ... 

我在以下代码中使用了左外连接:

set n=1000;

select      v.id
           ,v.val
           ,r.from_val
           ,r.to_val

from      val v
        left outer join    

 (select  r.*
                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

            from    val_range r
                    lateral view    posexplode
                                    (
                                        split
                                        (
                                            space
                                            (
                                                cast
                                                (
                                                    floor(to_val/${hiveconf:n}) 
                                                  - floor(from_val/${hiveconf:n}) 

                                                    as int
                                                )
                                            )
                                           ,' '
                                        )
                                    ) pe as i,x
            ) r



            on      floor(v.val/${hiveconf:n})    =
                    r.match_val

where       v.val between r.from_val and r.to_val

order by    v.id       
;

然而,新表的记录数量大幅减少~2.7万条记录中的记录数量。如果我使用left outer join怎么办?我该如何解决?

1 个答案:

答案 0 :(得分:1)

假设我们有v.id

set n=1000;

select      v.id
           ,r.from_val
           ,r.to_val

from                    val     v 

            left join  (select      v.id
                                   ,r.from_val
                                   ,r.to_val

                        from                val     v 

                                    join    (...)   r 

                                    on      floor(v.val/${hiveconf:n})    =
                                            r.match_val

                        where       v.val between r.from_val and r.to_val
                        ) r

            on          r.id    =
                        v.id

order by    v.id       

对于OP请求,以下是完整查询:

set n=1000;

select      v.id
           ,r.from_val
           ,r.to_val

from                    val     v 

            left join  (select      v.id
                                   ,r.from_val
                                   ,r.to_val

                        from                val     v 

                                    join   (select  r.*
                                                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

                                            from    val_range r
                                                    lateral view    posexplode
                                                                    (
                                                                        split
                                                                        (
                                                                            space
                                                                            (
                                                                                cast
                                                                                (
                                                                                    floor(to_val/${hiveconf:n}) 
                                                                                  - floor(from_val/${hiveconf:n}) 

                                                                                    as int
                                                                                )
                                                                            )
                                                                           ,' '
                                                                        )
                                                                    ) pe as i,x
                                            ) r

                                    on      floor(v.val/${hiveconf:n})    =
                                            r.match_val

                        where       v.val between r.from_val and r.to_val
                        ) r

            on          r.id    =
                        v.id

order by    v.id       
相关问题