Hive Count(DISTINCT列)与SELECT COUNT(*)from(SELECT DISTINCT列)

时间:2014-08-27 00:06:35

标签: performance hive hiveql

有讨论并声称查询2比查询1更快。

  

查询1

     

从TAB_X中选择COUNT(DISTINCT A);

     

QUERY 2

     

SELECT COUNT(*)FROM(SELECT DISTINCT A FROM TAB_X)

我无法理解为什么会这样。

这是我对如何将这些查询转换为地图后的减少的理解。

Query 1 
- Only one stage
- The mappers emit the Column A as the key and the value as 1. **Is this correct? How distinct is achieved?**
- There would be only one reducer, which would have to just increment the counter for every key and the list of values that it gets. However, not sure how would that single reducer knows when to emit the final count (**how does it know when to emit eventually?**).
  

查询-2    - 两个阶段

- Stage 1
     - The mappers emit the key as the column A and the value as 1
     - There will be a lot of reducers, which can aggregate the results for each key and emit the results of that key (which is column A).
     
    
        
  • 第二阶段     
          
    • 映射器获取每个用户的详细信息,并为所有用户发送相同的密钥,值为1。
    •     
    • 减速器只会对这些计数求和并发出最终结果。
    •     
  •     
  

请帮助理解/回答我的问题内联查询1并确认我对查询2的理解?

0 个答案:

没有答案