在猪中创建自定义代理键

时间:2016-04-20 09:17:56

标签: apache-pig uniqueidentifier surrogate-key

有没有办法在Pig中创建自定义代理键?

例如:我们有以下数据

Salary City Name

20000 newyork john   
30000 sydney joseph   
60000 delhi mike   
30000 sydney joseph

对于这些数据,我们需要创建如下的代理键,结果应如下所示。

     Salary City Name

SCN1 20000 newyork john    
SCN2 30000 sydney joseph   
SCN3 60000 delhi mike  
SCN2 30000 sydney joseph

而不是创建随机唯一键?

先谢谢!!。

2 个答案:

答案 0 :(得分:1)

首先区分数据,使用RANK和CONCAT获取每个不同行的自定义键。然后使用原始数据集加入distinct。最后生成所需的列。

A = LOAD 'data.txt' USING PigStorage('\t');
B = DISTINCT A;
C = RANK B;
D = FOREACH C GENERATE CONCAT('SCN',$0),$1,$2,$3;
E = JOIN A BY ($0,$1,$2),D BY ($1,$2,$3);
F = FOREACH E GENERATE E::$3,E::$0,E::$1,E::$2;
DUMP F;

这是样本数据

的工作原理

<强> A

20000 newyork john   
30000 sydney joseph   
60000 delhi mike   
30000 sydney joseph

<强>乙

20000 newyork john   
30000 sydney joseph   
60000 delhi mike

<强> C

1 20000 newyork john   
2 30000 sydney joseph   
3 60000 delhi mike

D

SCN1 20000 newyork john   
SCN2 30000 sydney joseph   
SCN3 60000 delhi mike

E

20000 newyork john SCN1 20000 newyork john     
30000 sydney joseph SCN2 30000 sydney joseph   
60000 delhi mike SCN3 60000 delhi mike 
30000 sydney joseph SCN2 30000 sydney joseph 

<强>˚F

SCN1 20000 newyork john    
SCN2 30000 sydney joseph   
SCN3 60000 delhi mike  
SCN2 30000 sydney joseph

答案 1 :(得分:0)

感谢Inquistive Mind帮助我创造出独特的代理密钥。这是我测试过的猪脚本。

 A = LOAD '/user/root5/data3.txt' USING PigStorage(',');
 B = DISTINCT A;
 C = RANK B;
 D = FOREACH C GENERATE CONCAT('SCN',$0),$1,$2,$3;
 E = JOIN A BY ($0,$1,$2),D BY ($1,$2,$3);
 F = FOREACH E GENERATE $3, $0, $1, $2;
 DUMP F;

每个步骤的输出如下:

DUMP A;
(20000,newyork,john)
(30000,sydney,joseph)
(60000,delhi,mike)
(20000,newyork,john)
(30000,sydney,mike)
(60000,delhi,mike)  

DUMP B;
(20000,newyork,john)
(30000,sydney,mike)
(30000,sydney,joseph)
(60000,delhi,mike)

DUMP C;
(1,20000,newyork,john)
(2,30000,sydney,mike)
(3,30000,sydney,joseph)
(4,60000,delhi,mike)

DUMP D;
(SCN1,20000,newyork,john)
(SCN2,30000,sydney,mike)
(SCN3,30000,sydney,joseph)
(SCN4,60000,delhi,mike)

DUMP E;
(20000,newyork,john,SCN1,20000,newyork,john)
(20000,newyork,john,SCN1,20000,newyork,john)
(30000,sydney,mike,SCN2,30000,sydney,mike)
(30000,sydney,joseph,SCN3,30000,sydney,joseph)
(60000,delhi,mike,SCN4,60000,delhi,mike)
(60000,delhi,mike,SCN4,60000,delhi,mike)

DUMP F;
(SCN1,20000,newyork,john)
(SCN1,20000,newyork,john)
(SCN2,30000,sydney,mike)
(SCN3,30000,sydney,joseph)
(SCN4,60000,delhi,mike)
(SCN4,60000,delhi,mike)'
相关问题