Pig:将具有相同列a的行的列b相加

时间:2014-05-22 22:12:31

标签: apache-pig

我试图计算一段时间内带有特定主题标签的推文数量,但是在尝试使用内置SUM函数时遇到错误。

示例:

  data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int,   year:int, month:int, day:int, hour:int, minute:int, second:int);
  NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';   
   NBLNabilVoto_group = GROUP NBLNabilVoto by count;
   X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count); 

错误:

<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

3 个答案:

答案 0 :(得分:0)

首先加载数据,然后过滤您要处理的时间间隔。根据主题标签对记录进行分组。使用count()函数计算相应hashtag的twitter数量。

答案 1 :(得分:0)

我不确定代码是按照您的想法或希望它做的,但是您得到的错误是因为您在错误的事情上做SUM。你需要这样做

X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count);

NBLNabilVoto_count是数据库中元组的名称

答案 2 :(得分:0)

我认为你在SUM中使用了错误的实现,你可以将SUM NBLNabilVoto_count用于数据实现。我有问题为什么你要COUNT?

如果你想用主题标签NBLNabilVoto计算你的所有推文。

我认为代码必须像:

data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int,   year:int, month:int, day:int, hour:int, minute:int, second:int);
  NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';   
   NBLNabilVoto_group = GROUP NBLNabilVoto by all;
   X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);