如何将两个猪声明合二为一?

时间:2014-04-29 18:37:01

标签: apache-pig

这个两阶段pig处理有效:

my_out = foreach (group my_in by id) {
  grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    grouped as grouped;
};
my_out1 = foreach my_out {
  keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
  generate id, domains, keywords;
};

然而,当我把它们结合起来时:

my_out = foreach (foreach (group my_in by id) {
  grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    grouped as grouped;
  }) {
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
    generate id, domains, keywords;
  };

我收到错误:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5.

我的问题是:

  1. 如何避免此错误?
  2. 我想要做什么甚至有意义? 即使我设法做到这一点,这会为我节省MR通行证吗?

1 个答案:

答案 0 :(得分:2)

通常,Pig解析复杂嵌套表达式的能力是不可靠的。嵌套变得太难处理的另一个常见错误是ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""

我经常尝试这样做,以避免为计算中的中间步骤之外的没有任何意义的别名提出一堆名称。但是有时候你不可能发现它。我的猜测是嵌套一个嵌套的foreach是不行的。但在你的情况下,看起来第一个嵌套的foreach不是必需的。试试这个:

my_out = foreach (foreach (group my_in by id)
  generate
    group as id,
    CountEach(my_in.domain) as domains,
    BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped
  ) {
    keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
    generate id, domains, keywords;
  };

至于你的第二个问题,,这对最终的MR计划没有任何影响。这纯粹是Pig解析你的脚本的问题;通过以这种方式对命令进行分组,map-reduce逻辑不会改变。