通过连接两个数据集来查找平均值

时间:2015-01-06 03:35:17

标签: apache-pig inner-join

我有两个数据集,

EmployeeDetail(data set 1):- 
   id  
   name
   gender
   location 

SalaryDetail(data set 2):-
   id
   salary

我需要同时加入并找出每个地方的男女平均工资。所以我尝试了以下代码。

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as 
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as 
(id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;                                                                         
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation { 
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, 
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};

但它低于错误

<line 6, column 22>  Syntax error, unexpected symbol at or near 
'JoinedEmpDetail'

任何人都可以帮助我在哪里犯错误或如何正确地做到这一点?

为了更清楚地了解我的要求,我提供了一些样本数据集。

EmpDetail.txt

1   Biswa   Male    Bangalore
12  Bratati Mahapatra   Female  Chennai
2   Bibhu kalyan    Male    Bangalore
3   Chinta  Male    Mumbai
10  Amrit Anand Male    Bangalore
11  Sateesh panda   Male    Bangalore
4   Kirti Kumar Male    Mumbai
6   Shruthi Female  Chennai
7   Vijay   Male    Chennai
5   Bibhu   Male    Chennai
9   Bratati  Mohanty    Female  Bangalore
8   Rupa Mahapatra  Female  Bangalore
13  Salini  Female  Mumbai
14  Priyanka Chopra Female  Mumbai

EmpSalary.txt

1   10000
12  12000
2   15900
3   9000
10  8000
11  13400
4   7600
6   22000
7   17000
5   16800
9   9800
8   10000
13  11000
14  12500

我需要的最终结果是:

Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>

1 个答案:

答案 0 :(得分:1)

您可以使用简单的foreach stmt来解决此问题,因此请不要使用嵌套的foreach stmt。

Group command在嵌套的Foreach中不起作用,它在猪身上受限制。嵌套的foreach(CROSS,DISTINCT,FILTER,FOREACH,LIMIT和ORDER BY)中只允许使用少量命令。

你能改变你的剧本吗?

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;

<强>输出:

(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)