如何在PIG中获取一组字段的DISTINCT值?

时间:2017-03-01 20:23:13

标签: elasticsearch apache-pig

是否可以在PIG中获得以下输出?我能否在第1和第2场使用Group,然后在第3场进行DISTINCT?

For example
I have input data

12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585



 I want output something like

    12345|9658965|52145
    23456|8541232|96589
    23456|8541232|96585

2 个答案:

答案 0 :(得分:1)

方法1:使用DISTINCT

参考: http://pig.apache.org/docs/r0.12.0/basic.html#distinct

DISTINCT运营商应该提供帮助

test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;

方法2:GROUP BY所有字段

test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ($0,$1,$2);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;

这两种方法都为输入共享提供了预期的输出。

答案 1 :(得分:0)

尝试this,它非常相似:

A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
    unique  =
        FOREACH (GROUP A BY a3) {
            b = A.(a1,a2);
            s = DISTINCT b;
            GENERATE FLATTEN(s), group AS a4;
        };
相关问题