为什么Hive索引无法提高查询速度

时间:2013-09-09 19:58:02

标签: sql hadoop indexing hive

我有一个外部Hive表,其结构基本上类似于:

CREATE EXTERNAL TABLE foo (time double, name string, value double)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hfds://node/foodir

我为(name, value)创建了一个索引。

CREATE INDEX idx ON TABLE foo(name, value) 
AS ’org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’
WITH DEFERRED REBUILD;
ALTER INDEX ts_idx ON trionsort REBUILD;

我的查询是:

SELETE minute, count(minute) AS mincount
FROM (SELECT round(time/60) AS minute FROM foo WHERE name = 'Foo' 
and value > 100000) t2 GROUP BY minute ORDER BY mincount DESC LIMIT 1;

但是,尽管满足条件(name = 'Foo' and value > 100000)的行可能只占所有行的0.1%。此Hive查询仍然针对整个数据集运行,速度与在没有索引的表上运行相当。

索引方案或查询有什么问题吗?

运行EXPLAIN SELECT...

的输出
[rn14n21] out: OK
[rn14n21] out: ABSTRACT SYNTAX TREE:
[rn14n21] out:   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME log))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION round (/ (TOK_TABLE_OR_COL time) 60)) hour)) (TOK_WHERE (> (TOK_TABLE_OR_COL value) 1000000)))) t2)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL hour)) (TOK_SELEXPR (TOK_FUNCTION count (TOK_TABLE_OR_COL hour)) hrcount)) (TOK_GROUPBY (TOK_TABLE_OR_COL hour)) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (TOK_TABLE_OR_COL hrcount))) (TOK_LIMIT 3)))
[rn14n21] out:
[rn14n21] out: STAGE DEPENDENCIES:
[rn14n21] out:   Stage-1 is a root stage
[rn14n21] out:   Stage-2 depends on stages: Stage-1
[rn14n21] out:   Stage-0 is a root stage
[rn14n21] out:
[rn14n21] out: STAGE PLANS:
[rn14n21] out:   Stage: Stage-1
[rn14n21] out:     Map Reduce
[rn14n21] out:       Alias -> Map Operator Tree:
[rn14n21] out:         t2:log
[rn14n21] out:           TableScan
[rn14n21] out:             alias: log
[rn14n21] out:             Filter Operator
[rn14n21] out:               predicate:
[rn14n21] out:                   expr: (value > 1000000.0)
[rn14n21] out:                   type: boolean
[rn14n21] out:               Select Operator
[rn14n21] out:                 expressions:
[rn14n21] out:                       expr: round((time / 60))
[rn14n21] out:                       type: double
[rn14n21] out:                 outputColumnNames: _col0
[rn14n21] out:                 Group By Operator
[rn14n21] out:                   aggregations:
[rn14n21] out:                         expr: count(_col0)
[rn14n21] out:                   bucketGroup: false
[rn14n21] out:                   keys:
[rn14n21] out:                         expr: _col0
[rn14n21] out:                         type: double
[rn14n21] out:                   mode: hash
[rn14n21] out:                   outputColumnNames: _col0, _col1
[rn14n21] out:                   Reduce Output Operator
[rn14n21] out:                     key expressions:
[rn14n21] out:                           expr: _col0
[rn14n21] out:                           type: double
[rn14n21] out:                     sort order: +
[rn14n21] out:                     Map-reduce partition columns:
[rn14n21] out:                           expr: _col0
[rn14n21] out:                           type: double
[rn14n21] out:                     tag: -1
[rn14n21] out:                     value expressions:
[rn14n21] out:                           expr: _col1
[rn14n21] out:                           type: bigint
[rn14n21] out:       Reduce Operator Tree:
[rn14n21] out:         Group By Operator
[rn14n21] out:           aggregations:
[rn14n21] out:                 expr: count(VALUE._col0)
[rn14n21] out:           bucketGroup: false
[rn14n21] out:           keys:
[rn14n21] out:                 expr: KEY._col0
[rn14n21] out:                 type: double
[rn14n21] out:           mode: mergepartial
[rn14n21] out:           outputColumnNames: _col0, _col1
[rn14n21] out:           Select Operator
[rn14n21] out:             expressions:
[rn14n21] out:                   expr: _col0
[rn14n21] out:                   type: double
[rn14n21] out:                   expr: _col1
[rn14n21] out:                   type: bigint
[rn14n21] out:             outputColumnNames: _col0, _col1
[rn14n21] out:             File Output Operator
[rn14n21] out:               compressed: false
[rn14n21] out:               GlobalTableId: 0
[rn14n21] out:               table:
[rn14n21] out:                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
[rn14n21] out:                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
[rn14n21] out:
[rn14n21] out:   Stage: Stage-2
[rn14n21] out:     Map Reduce
[rn14n21] out:       Alias -> Map Operator Tree:
[rn14n21] out:         hdfs://rn14n21/tmp/hive-lei/hive_2013-09-12_21-19-33_247_861290513429832428/-mr-10002
[rn14n21] out:             Reduce Output Operator
[rn14n21] out:               key expressions:
[rn14n21] out:                     expr: _col1
[rn14n21] out:                     type: bigint
[rn14n21] out:               sort order: -
[rn14n21] out:               tag: -1
[rn14n21] out:               value expressions:
[rn14n21] out:                     expr: _col0
[rn14n21] out:                     type: double
[rn14n21] out:                     expr: _col1
[rn14n21] out:                     type: bigint
[rn14n21] out:       Reduce Operator Tree:
[rn14n21] out:         Extract
[rn14n21] out:           Limit
[rn14n21] out:             File Output Operator
[rn14n21] out:               compressed: false
[rn14n21] out:               GlobalTableId: 0
[rn14n21] out:               table:
[rn14n21] out:                   input format: org.apache.hadoop.mapred.TextInputFormat
[rn14n21] out:                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
[rn14n21] out:
[rn14n21] out:   Stage: Stage-0
[rn14n21] out:     Fetch Operator
[rn14n21] out:       limit: 3
[rn14n21] out:
[rn14n21] out:
[rn14n21] out: Time taken: 11.284 seconds, Fetched: 99 row(s)

0 个答案:

没有答案