多个列和行的Hive / SQL计数出现次数

时间:2017-04-26 08:28:53

标签: sql hadoop hive hiveql

我正在寻找一种计算事件的智能方法。

以下是一个例子:

 UserID     CityID    CountryID   TagID
 100000      1         30        5
 100001      1         30        6
 100000      2         20        7
 100000      2         40        8
 100001      1         40        6
 100002      1         40        5
 100002      1         20        6

我想做什么:

我想按列和每个用户计算值的出现次数。最后,我想要一个表格,向我展示有多少用户拥有的不仅仅是不同的特征。

结果应该是这样 - 或多或少

Different_CityID    Different_CountryIDs   Different_TagIDs
1                   3                      2

说明:

  • Different_CityIDs:只是UserID 100000具有不同的CityID
  • Different_CountryIDs:所有用户的国家/地区ID都不同
  • Different_TagIDs:UserID 100000和100002都有不同的TagID。用户100001只有“6”作为TagID。

我为COUNTs争取了列和GROUP BYs,但最终它没有成功。有智能解决方案吗?

非常感谢

3 个答案:

答案 0 :(得分:1)

select  count(case when pos=0 and count_distinct_ID>1 then 1 end) as different_cityid
       ,count(case when pos=1 and count_distinct_ID>1 then 1 end) as different_countryid
       ,count(case when pos=2 and count_distinct_ID>1 then 1 end) as different_tagid

from   (select      pe.pos
                   ,count (distinct pe.ID) as count_distinct_ID
        from        mytable t
                    lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID

        group by    t.UserID
                   ,pe.pos        
        ) t          
;
+------------------+---------------------+-----------------+
| different_cityid | different_countryid | different_tagid |
+------------------+---------------------+-----------------+
|                1 |                   3 |               2 |
+------------------+---------------------+-----------------+

这是避免count(distinct ...)

的另一种变体
select  count (case when pos=0 and not is_distinct_ID then 1 end)  as different_cityid
       ,count (case when pos=1 and not is_distinct_ID then 1 end)  as different_countryid
       ,count (case when pos=2 and not is_distinct_ID then 1 end)  as different_tagid

from   (select      pe.pos
                   ,min(pe.ID)<=>max(pe.ID)  as is_distinct_ID
        from        mytable t
                    lateral view posexplode (array(CityID,CountryID,TagID)) pe as pos,ID

        group by    t.UserID
                   ,pe.pos        
        ) t          
; 

......和另一种变体

select  count (case when not is_distinct_CityID    then 1 end)   as different_cityid
       ,count (case when not is_distinct_CountryID then 1 end)   as different_countryid
       ,count (case when not is_distinct_TagID     then 1 end)   as different_tagid

from   (select      min (CityID)    <=> max (CityID)     as is_distinct_CityID
                   ,min (CountryID) <=> max (CountryID)  as is_distinct_CountryID
                   ,min (TagID)     <=> max (TagID)      as is_distinct_TagID

        from        mytable

        group by    UserID     
        ) t          
;

答案 1 :(得分:1)

使用以下代码,我认为它对您有帮助,

SELECT COUNT(DISTINCT (CountryID)) AS CountryID,
COUNT(DISTINCT(CityID)) AS CityID,
COUNT(DISTINCT(TagID)) AS TagID
FROM test GROUP BY UserID

结果将是这样的,

CountryID   CityID  TagID
2   3   3
1   2   1
1   2   2

此致 Vinu

答案 2 :(得分:1)

select uid,cid,count(c),count(g) from(select cid,uid,count(coid) over(partition by cid,uid) as c,count(tagid) over(partition by cid,tagid) as g from citydata)e group by cid,uid;

此处uid = userid,cid = cityid,coid = countryid,tagid

Total MapReduce CPU Time Spent: 0 msec OK uid cid coid tagid 100000 1 1 1 100001 1 2 2 100002 1 2 2 100000 2 2 2 Time taken: 3.865 seconds, Fetched: 4 row(s)

基于userid我希望这会有所帮助