计算Pig Latin中的任务

时间:2013-07-23 16:27:25

标签: apache-pig

假设我有一对情侣列表(id, value)和一个potentialIDs列表。

对于每个potentialIDs我想要计算ID在第一个列表中出现的次数。

E.g。

couples:
1 a
1 x
2 y

potentialIDs
1
2
3

Result:
1 2
2 1
3 0

我试图在PigLatin中这样做,但这似乎并不重要。

你能给我一些提示吗?

1 个答案:

答案 0 :(得分:1)

总体规划是:您可以按ID对夫妻进行分组并执行COUNT,然后对潜在ID和COUNT的输出进行左联接。从那里你可以根据需要进行格式化。代码应该更详细地解释如何执行此操作。

注意:如果您需要我详细了解,请告诉我,但我认为这些评论应该可以很好地解释发生了什么。

-- B generates the count of the number of occurrences of an id in couple
B = FOREACH (GROUP couples BY id) 
    -- Output and schema of the group is:
    -- {group: chararray,couples: {(id: chararray,value: chararray)}}
    -- (1,{(1,a),(1,x)})
    -- (2,{(2,y)})

    -- COUNT(couples) counts the number of tuples in the bag
    GENERATE group AS id, COUNT(couples) AS count ;

-- Now we want to do a LEFT join on potentialIDs and B since it will
-- create nulls for IDs that appear in potentialIDs, but not in B
C = FOREACH (JOIN potentialIDs BY id LEFT, B BY id) 
    -- The output and schema for the join is:
    -- {potentialIDs::id: chararray,B::id: chararray,B::count: long}
    -- (1,1,2)
    -- (2,2,1)
    -- (3,,)

    -- Now we pull out only one ID, and convert any NULLs in count to 0s
    GENERATE potentialIDs::id, (B::count is NULL?0:B::count) AS count ;

C的架构和输出是:

C: {potentialIDs::id: chararray,count: long}
(1,2)
(2,1)
(3,0)

如果您不想C中的disambiguate operator(::),则只需将GENERATE行更改为:

GENERATE potentialIDs::id AS id, (B::count is NULL?0:B::count) AS count ;