猪的合并线

时间:2014-12-27 14:13:59

标签: apache-pig

我想为下面的查询编写一个猪脚本。

输入是:

AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD

输出应为:

AAA,BBB,,DDD
AAA,BBB,CCC,DDD
AAA,BBB,,DDD

我试过Merge two lines in Pig但是如果我试图拆分Bag Baglit(3,$ 1)然后输出不正确,因为我的输出将合并前三行然后接下来的四行再次接下来的三行线

输入可能会增加,但最后一行的一个重要事情始终是,,, DDD。

有人可以帮助我吗?

1 个答案:

答案 0 :(得分:0)

您的输入数据应分成不同的长度(3,4,3),因此BagSplit函数在这种情况下不起作用。你能尝试以下方法吗?关系E (TOTUPLE)的重复部分可以使用MACROS进一步优化,但会导致更多的混淆,因此我目前还没有优化。

<强> input.txt中

AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD

<强> PigScript:

A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = RANK A;
C = GROUP B ALL;
D = FOREACH C  {
                 firstRecord = FILTER B BY rank_A<=3;                /* store first 3 records*/
                 secondRecord= FILTER B BY rank_A>3 AND rank_A<=7;   /* store next 4 records */
                 thirdRecord = FILTER B BY rank_A>7;                 /* store next 3 records */
                 GENERATE firstRecord,secondRecord,thirdRecord;
                }

/* Convert each split bags(firstRecord,secondRecord and thirdRecord) into strings and replace 'null' and '_' with  empty characters.*/
E = FOREACH D GENERATE FLATTEN(TOBAG(
                                        TOTUPLE(REPLACE(BagToString(firstRecord.f1),'[null|_]',''),
                                                REPLACE(BagToString(firstRecord.f2),'[null|_]',''),
                                                REPLACE(BagToString(firstRecord.f3),'[null|_]',''),
                                                REPLACE(BagToString(firstRecord.f4),'[null|_]','')),
                                        TOTUPLE(REPLACE(BagToString(secondRecord.f1),'[null|_]',''),
                                                REPLACE(BagToString(secondRecord.f2),'[null|_]',''),
                                                REPLACE(BagToString(secondRecord.f3),'[null|_]',''),
                                                REPLACE(BagToString(secondRecord.f4),'[null|_]','')),
                                        TOTUPLE(REPLACE(BagToString(thirdRecord.f1),'[null|_]',''),
                                                REPLACE(BagToString(thirdRecord.f2),'[null|_]',''),
                                                REPLACE(BagToString(thirdRecord.f3),'[null|_]',''),
                                                REPLACE(BagToString(thirdRecord.f4),'[null|_]',''))
                                        )
                                 );
DUMP E;

<强>输出:

(AAA,BBB,,DDD)
(AAA,BBB,CCC,DDD)
(AAA,BBB,,DDD)
相关问题