具有多个分区的Hive表

时间:2017-08-30 08:50:29

标签: hive hiveql

我有一个表(data_table),其中包含多个分区列年/月/月密钥。

目录看起来像年份= 2017 /月= 08 / monthkey = 2017-08 / files.parquet

以下哪个查询会更快?

select count(*) from data_table where monthkey='2017-08'

select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'

我认为hadoop在第一种情况下找到所需目录的初始时间会更多。但是想确认一下

2 个答案:

答案 0 :(得分:3)

查找相关分区是一个Metastore操作,是文件系统操作 通过扫描目录来查询元数据和 第一个用例的元数据查询很可能比第二个用例更快,但无论如何我们在这里讨论的是几分之一秒。

演示

create external table t100k(i int)
partitioned by (x int,y int,xy string)
;
explain dependency select count(*) from t100k where xy='100-1000';

针对Metastore发出的查询:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"  
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 't100k'   
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = 'local_db' 
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2 
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';

针对Metastore发出的查询:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"  
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 't100k'   
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = 'local_db' 
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1 
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2 
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100) 
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))  
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )

答案 1 :(得分:0)

由于评论会更改格式,因此会在此处发布。 请接受@ Dudu的回复。请在Metastore DB(我的情况下是mysql)上执行以下命令:

mysql> select part_id, location, tbl_id, part_name from PARTITIONS as P inner join SDS as S on P.SD_ID = S.SD_ID where P.TBL_ID = 472;
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
| part_id | location                                                                | tbl_id | part_name                            |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
|       7 | hdfs://hostname:8020/tmp/multi_part/2011/01/2011-01 |    472 | year=2011/month=1/year_month=2011-01 |
|       9 | hdfs://hostname:8020/tmp/multi_part/2012/01/2012-01 |    472 | year=2012/month=1/year_month=2012-01 |
+---------+-------------------------------------------------------------------------+--------+--------------------------------------+
2 rows in set (0.00 sec)

两个查询的位置将从同一个hdfs目录中提取数据。 唯一的速度差异来自于在Dudu的答案中已经解释过的Metastore数据库查询。