替代/更有效的方法比自我加入配置单元查询

时间:2014-08-12 15:23:23

标签: database hadoop hive self-join

我有一个下面描述的hadoop表:

mappers
id (int)
mapper (String)
mapperid (int)
date (int)

一些样本行将如下所示

1, MAP1, 123, 20140810
1, MAP2, 3421, 20140810
2, MAP1, 34211, 20140810
2, MAP3, 1143, 20140810
3, MAP4, 12, 20140810

我正在尝试将这些结果压缩为与它们关联的唯一ID和mapperId。 我希望我的查询根据上面的示例数据返回:

1, 123, 3421, null
2, 34211, null, 1143

下面是我的hive查询,其中我基本上抓取了我想要的所有数据,并且它们在适用的情况下将数据复合在一起,不幸的是,这需要4个表查找。

select distinct 
full.id, 
mapper01.mapperid, 
mapper02.mapperid, 
mapper03.mapperid 
FROM mappers as full
LEFT JOIN (
    select id, mapperid FROM mappers
    WHERE mapper = "MAP1" AND
    date = 20140810 AND
    length(id) > 0
) AS mapper01 ON mapper01.id = full.id
LEFT JOIN (
    select id, mapperid FROM mappers
    WHERE mapper = "MAP2" AND
    date = 20140810 AND
    length(id) > 0
) AS mapper02 ON mapper02.id = full.id
LEFT JOIN (
    select id, mapperid FROM mappers
WHERE mapper = "MAP3" AND
    date = 20140810 AND
    length(id) > 0
) AS mapper03 ON mapper03.id = full.id
WHERE date = 20140810 AND
length(id) > 0 AND
(full.mapper = "MAP1" OR
full.mapper = "MAP2" OR
full.mapper = "MAP3"
);

我正在考虑使用FULL OUTER JOIN而不是LEFT JOINS,这样我只需要3个表查找(最外部抓取所有数据都是冗余的)并使用一些IF Logic从其中一个获取full.id具有该信息的表格。

但是我想知道是否有比使用连接更好的方法。

1 个答案:

答案 0 :(得分:0)

我写了一篇关于如何使用Brickhouse(http://github.com/klout/brickhouse)'收集' UDF解决了这个问题。 (http://brickhouseconfessions.wordpress.com/2013/03/05/use-collect-to-avoid-the-self-join/)。

在Hadoop最终对数据进行多次扫描时,不使用自连接,而是使用collect将数据分组到单个映射中。然后使用' map_index'从该地图访问您想要的元素。 UDF。您的查询将是这样的:

SELECT ID, MAP_INDEX( id_map, 'MAP1' ) as MAP1_ID,
           MAP_INDEX( id_map, 'MAP2' ) as MAP2_ID,
           MAP_INDEX( id_map, 'MAP3' ) as MAP3_ID
 FROM 
   ( SELECT id, COLLECT( mapper, mapperid) as id_map
      FROM mappers
      WHERE  date = 20140810 
       AND  length(id) > 0
     GROUP BY id
   ) m ;