用MAX()分组和

时间:2019-04-07 21:44:51

标签: hadoop apache-pig hortonworks-data-platform

我有一个包含年,国家,性别和人口列的数据集。 我需要在最近一年找出人口最多的国家

a = group data by Country;
b = foreach a generate flatten(group), MAX(data.Year);
# Until here I am able to get the country and latest year 
# SUM on data.Population is giving errors

我需要按照以下国家/地区,年份和人口(仅当年)的顺序来获取结果

1 个答案:

答案 0 :(得分:0)

获得每个国家/地区的最大年份后,将数据集与第一个负荷关系相对应,然后按国家和年份分组以得出人口总数。

假设您已将数据加载到名为data的关系中。将数据与b分别代表国家和年份。

data = load 'data_file' using PigStorage(',') as (country:chararray,year:int,population:int);
a = group data by country;
b = foreach a generate flatten(group) as country, MAX(data.Year) as year;
c = join data by (country,year), b BY (country,year);
c1 = foreach c generate data.country as country,data.year as year,data.population as population;
d = group c1 by c1.country,c1.year;
e = foreach d generate FLATTEN(group) as country,year,SUM(d.population);
dump e;