使用pig计算组中的不同值

时间:2016-01-19 01:24:00

标签: apache-pig

我的一般意义上的问题是,我想对数据进行分组,然后计算字段的uniq值。 具体来说,对于下面的数据,我想按“类别”和“年份”进行分组,然后计算“食物”的uniq值。

category,id,mydate,mystore,food    
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple

这是我可以得到的,这只是挑选价值并使用一些整洁的猪日期函数:

a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);
b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;


c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;      
dump c;

别名'c'的输出是:

(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2015,store1,milk)
(catB,2014,store2,milk)
(catB,2014,store2,apple)

我最终想要:

catA, 2014, {(apple, 2), (milk, 2)} 
catA, 2015, {(milk, 1)} 
catB, 2014, {(apple, 1), (milk, 1)} 

我已经看到了一些产生价值计数的例子,但是按类别和年份分组会让我感到沮丧。

2 个答案:

答案 0 :(得分:1)

输入:

category,id,mydate,mystore,food

0 0 1 * * /path/to/script

是的,您可以在分组后使用嵌套的FOREACH,在嵌套的FOREACH中,您可以对食物应用Distinct,然后您可以计算。

以下代码可以帮助您

Pig Script:

catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple

输出:

list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);

list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;

list_grp = GROUP list_each BY (category,my_year);

list_nested_each = FOREACH list_grp

                            {
                               list_inner_each = FOREACH list_each GENERATE food;
                               list_inner_dist = DISTINCT list_inner_each;

                             GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;

                            };

dump list_nested_each;

答案 1 :(得分:0)

附加到问题中的代码:

@redirect_uri = params[:redirect_uri] << "&show_more_pages=false"

将产生:

d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;

关键是按“食物”分组。有趣。欢迎任何其他见解。