Apache PIG - 将当前行的日期设置为下一个记录的日期

时间:2016-08-30 13:47:06

标签: apache-pig

在Pig中,我要求将avail_until设置为给定特定id的下一个记录'avail_since,并将其默认为给定id的最后一个记录的9999-12-31。我首先按ID排序数据,然后是Avail_Since,但之后就被卡住了。我想我可能需要过度/缝合/超前/滞后功能但不确定。任何帮助将不胜感激!

 Input Data:

 ID       AVAIL_SINCE    AVAIL_UNTIL
 1        19-Jan-00      31-Dec-99
 1        11-Jun-00      31-Dec-99
 1        4-Aug-00       31-Dec-99
 1        19-May-01      31-Dec-99 
 2        5-May-02       31-Dec-99 
 2        8-Apr-03       31-Dec-99 
 3        10-Jun-00      31-Dec-99 
 3        31-Oct-00      31-Dec-99 
 3        29-Dec-00      31-Dec-99  

 Required Result:

 ID       AVAIL_SINCE    AVAIL_UNTIL
 1        19-Jan-00      11-Jun-00
 1        11-Jun-00      4-Aug-00
 1        4-Aug-00       19-May-01
 1        19-May-01      31-Dec-99
 2        5-May-02       8-Apr-03 
 2        8-Apr-03       31-Dec-99
 3        10-Jun-00      31-Oct-00
 3        31-Oct-00      29-Dec-00
 3        29-Dec-00      31-Dec-99

2 个答案:

答案 0 :(得分:0)

您必须加载数据两次,对其进行排名以生成唯一ID,从第二个数据集中筛选顶级记录,再次对其进行排名,然后在唯一ID上加入数据集,从第一个数据集和联合中获取最后一条记录它与已连接的数据集有关。见下文

脚本

A = LOAD 'test9.txt' USING PigStorage('\t') as (A1:int,A2:chararray,A3:chararray);
B = LOAD 'test9.txt' USING PigStorage('\t') as (B1:int,B2:chararray,B3:chararray);
RankA = rank A;
RankB = rank B;

BB = FILTER RankB by (rank_B > 1);
BB_New = rank BB;

AB = JOIN RankA by rank_A,BB_New by rank_BB;
AB_ALL = foreach AB GENERATE RankA::A1,RankA::A2,BB_New::B2;
A_Order = ORDER RankA by rank_A desc;
A_Last = LIMIT A_Order 1;
A_Fields = foreach A_Last generate $1,$2,$3;

FINAL = UNION A_Fields,AB_ALL;
FINALORDER = ORDER FINAL BY $0;
DUMP FINALORDER;

输出

enter image description here

答案 1 :(得分:0)

我将扩展@inuistive_mind的解决方案以获得确切的结果..

modsec_audit.log

需要添加步骤

A = LOAD 'test9.txt' USING PigStorage('\t') as (A1:int,A2:chararray,A3:chararray);
B = LOAD 'test9.txt' USING PigStorage('\t') as (B1:int,B2:chararray,B3:chararray);
RankA = rank A;
RankB = rank B;

BB = FILTER RankB by (rank_B > 1);
BB_New = rank BB;

AB = JOIN RankA by rank_A,BB_New by rank_BB;
AB_ALL = foreach AB GENERATE RankA::A1,RankA::A2,BB_New::B2;
A_Order = ORDER RankA by rank_A desc;
A_Last = LIMIT A_Order 1;
A_Fields = foreach A_Last generate $1,$2,$3;

FINAL = UNION A_Fields,AB_ALL;
FINALORDER = ORDER FINAL BY $0;

希望这个提示可以帮助你达到最终结果..