如何用(b,a)过滤(a,b)关系?

时间:2013-08-29 16:13:47

标签: hadoop apache-pig

我有一个像这样的通用关系:

DUMP A;
(a, b)
(a, c)
(a, d)
(b, a)
(d, a)
(d, b)

看到有一对(a,b)和(b,a);但是(d,b)没有一对。 我想过滤那些" unpaired"元组出来了。

最终结果应该是:

DUMP R; 
(a, b)
(a, d)
(b, a)
(d, a)

我怎样才能在PIG上写这个?

我能够使用以下代码解决,但交叉操作太昂贵了:

A_cp = FOREACH L GENERATE u1, u2;
X = CROSS A, A_cp;
F = FILTER X BY ($0 == $3 AND $1 == $2);
R = FOREACH F GENERATE $0, $1;

1 个答案:

答案 0 :(得分:1)

这是我的DESCRIBE A ; DUMP A ;

的输出
A: {first: chararray,second: chararray}
(a,b)
(a,c)
(a,d)
(b,a)
(d,a)
(d,b)

这是解决这个问题的一种方法:

A = LOAD 'foo.in' AS (first:chararray, second:chararray) ;
-- Can't do a join on its self, so we have to duplicate A
A2 = FOREACH A GENERATE * ;

-- Join the As so that are in (b,a,a,c) etc. pairs.
B = JOIN A BY second, A2 BY first ; 

-- We only want pairs where the first char is equal to the last char.
C = FOREACH (FILTER B BY A::first == A2::second)
    -- Now we project out just one side of the pair.
    GENERATE A::first AS first, A::second AS second ;

输出:

C: {first: chararray,second: chararray}
(b,a)
(d,a)
(a,b)
(a,d)

更新:正如WinnieNicklaus指出的那样,这可以缩短为:

B = FOREACH (JOIN A BY (first, second), A2 BY (second, first))
    GENERATE A::first AS first, A::second AS second ;