蜂巢桶最大功能

时间:2018-08-25 15:47:02

标签: hive hiveql

我在HIVE中有一个表格结构,如下所示-

create table if not exists cdp_compl_status
(
EmpNo INT,
RoleCapability STRING,
EmpPUCode STRING,
SBUCode STRING,
CertificationCode STRING,
CertificationTitle STRING,
Competency STRING,
Certification_Type STRING,
Certification_Group STRING,
Contact_Based_Program_Y_N STRING,
ExamDate DATE,
Onsite_Offshore STRING,
AttendedStatus STRING,
Marks INT,
Result STRING,
Status STRING,
txtPlanCategory STRING,
SkillID1 INT,
Complexity STRING
)
CLUSTERED BY (Marks) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES('created on' = '12 Aug');

现在,我想从表中的每个存储区查询MAX(MARKS)。如果我愿意-

SELECT MAX(MARKS) from cdp_compl_status;  

它显示整个表格中的最大分数。有什么办法可以从每个存储桶中找出MAX(MARKS)吗?

2 个答案:

答案 0 :(得分:2)

由于您已将Table分为5个桶... 数据根据%函数分为多个存储区,例如: marks%5==0进入第一个存储桶 marks%5==1放入第二个存储桶 marks%5==2进入第三桶 marks%5==3进入第四桶 marks%5==4进入第五个桶

因此,您需要像这样编写5个查询: Select max(marks) from cdp_compl_status where marks%5=0;-在第一个存储分区中获得最大出价

我想应该这样做。

答案 1 :(得分:1)

使用表格样本:

select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 1 out of 5 on marks);

select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 2 out of 5 on marks);

select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 3 out of 5 on marks);

select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 4 out of 5 on marks);

select max(marks),min(marks),avg(marks) from cert_comp_status_buck
tablesample(bucket 5 out of 5 on marks);
相关问题