从猪的一袋元组中读取价值

时间:2017-03-04 18:54:18

标签: tuples apache-pig

我的UDF输出为: -

  

样本记录: -   ({(托德,1),(托德,1),(托德,1),(托德,1),(托德,1),(托德,5),(托德,10),(托德,20), (托德,10),(托德,10),(托德,10),(托德,10),(托德,10),(托德,10)})

     

({(乔,1),(乔,1),(乔,1),(乔,1),(乔,1),(乔,5),(乔,10),(乔恩, 20),(乔,10),(乔,10),(乔,10),(乔,10),(乔,5),(乔,20),(乔,1)})

UDF的架构: - 名称:chararray(1个单列)

现在我想阅读这包元组并生成输出: -

Todd,240
Jon,422

UDF的输出i存储在临时文件中,并使用不同的模式将其读回: -

D = LOAD '/home/training/pig/pig/UDFdata.txt' AS (B: bag {T: tuple(name:chararray, denom:int)});

之后我尝试使用foreach循环和参考点表示法来查找总和。

X = foreach D generate B.T.name,SUM(B.T.denom);
  

2017-03-04 13:52:59,507 ERROR org.apache.pig.tools.grunt.Grunt:ERROR   1128:在名称中找不到字段T:chararray,denom:int详细信息在   logfile:/home/training/pig_1488648405070.log

你能告诉我怎么找到它吗?我是Apache Pig的新手,所以不确定它如何在Bag of Tuples中遍历并找到总和。

1 个答案:

答案 0 :(得分:0)

在执行SUM之前对名称上的数据集进行分组。

FLATTEN要执行的行李GROUP

flattened = FOREACH D GENERATE FLATTEN(B);

dump flattened;
...
(Todd,10)
(Todd,10)
(Jon,1)
(Jon,1)
....

然后,在GROUP

name他们
grouped = GROUP flattened by name;

dump grouped;
(Jon,{(Jon,1),(Jon,20),(Jon,5),(Jon,10),(Jon,10),(Jon,10),(Jon,10),(Jon,20),(Jon,10),(Jon,5),(Jon,1),(Jon,1),(Jon,1),(Jon,1),(Jon,1)})
(Todd,{(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,10),(Todd,20),(Todd,10),(Todd,5),(Todd,1),(Todd,1),(Todd,1),(Todd,1),(Todd,1)})

SUM()应用于结果

final_sum = FOREACH grouped GENERATE group, SUM(flattened.denom);

dump final_sum;
(Jon,106)
(Todd,100)