提取特定数量的记录,以满足某些总体条件

时间:2016-03-11 16:25:57

标签: sql oracle

这里我描述一个抽象的案例,但它类似于我现在试图解决的案例。我知道如何使用PL / SQL块获得粗略结果,但我想知道是否有人可以使用单个选择查询来建议解决方案。

假设我们有一个表t_people,其中有数千条记录描述了一组具有以下属性集的人:

  • id
  • age,号码
  • height in cm,number
  • gender,varchar2('male'或'female')

我们需要提取N条记录,以便结果集符合以下条件:

  • 30%的选定人员高于180厘米
  • 60%的被选人是男性
  • 40%的被选人年龄超过40岁

我们也可以假设N远小于表中的总行数,问题是可以解决的。

您如何建议使用单个选择查询执行此操作?

由于

2 个答案:

答案 0 :(得分:3)

您可以将数据分为8组,然后从每组中取出比例样本以满足您的要求。一种粗略的方法是将条件转换为组,例如:

  • 高于180的300人,不是男性,不是年长
  • 100人短,不是男性,不是年长
  • 400人短,男性,年长
  • 200人短,男性,不年长

然后你可以解决这个问题:

with p as (
      select p.*,
             row_number() over (partition by height, male, age order by height) as seqnum
      from (select p.*,
                   (case when height > 180 then 1 else 0 end) as height,
                   (case when gender = 'male' then 1 else 0 end) as male,
                   (case when age > 40 then 1 else 0 end) as age
            from people p
           ) p
     )
select p.*
from p
where (height = 1 and male = 0 and age = 0 and seqnum <= 300) or
      (height = 0 and male = 0 and age = 0 and seqnum <= 100) or
      (height = 0 and male = 1 and age = 1 and seqnum <= 400) or
      (height = 0 and male = 1 and age = 0 and seqnum <= 200);

您可以使用另一种方法,均匀地填充8个桶,跟踪每个维度的数字(年龄/年龄,男/女,更短/更高)。然后在填充第一个维度时停止填充并继续填充4个互补单元格。重复此过程,直到获得所需的数字。

答案 1 :(得分:0)

我最终选择suggested的第一种方法Gordon Linoff并做了一些小修改。我保留了最初的想法,但还引入了几个额外的子查询,以指定组内记录的所需分布,并构建一个矩阵,每个组具有所需的记录计数。还有全局参数段,其中包含指定总记录数的唯一参数。

查询产生非常有用的结果:

with 
    people as (
        select  id,
                floor(months_between(sysdate, date_birth)/12) age,
                195 - least(floor(months_between(sysdate, date_birth)/12), 50) height,
                decode(sex, 1, 'male', 'female') gender
        from    my_people_table
        where   date_birth is not null and rownum < 100000
    ),
    params as ( /* Global params */
        select  100 rec_count   -- total record count 
        from dual
    ),
    age_groups as (     /* distribution by height */
        select  'group 1' age_group, .7 prc from dual union
        select  'group 2' age_group, .3 prc from dual  
    ),
    height_groups as ( /* distribution by height */
        select  'group 1' height_group, .6 prc from dual union
        select  'group 2' height_group, .4 prc from dual  
    ),
    genders as (       /* distribution by gender */
        select  'male'   gender, .6 prc from dual union
        select  'female' gender, .4 prc from dual  
    ),
    mx as (            /* a matrix with record counts per group */
        select  age_group, height_group, gender,
                ceil(
                    age_groups.prc * 
                    height_groups.prc * 
                    genders.prc * 
                    rec_count
                )  rec_count       
        from    age_groups, height_groups, genders, params
    ),
    xpeople as (       /* Minor transformations - groups and group counters */
        select  p.*, 
                row_number() over (
                    partition by age_group, height_group, gender
                        order by age_group, height_group, gender
                ) rec_num
        from (                             
                select  people.*,
                        case 
                            when age    <=  40 then 'group 1' 
                                               else 'group 2' 
                        end age_group,
                        case 
                            when height <= 180 then 'group 1' 
                                               else 'group 2' 
                        end height_group
                from    people
        ) p
    )
/* the resulting query uses the matrix to filter the records */    
select  xpeople.*
from    xpeople join mx 
            on  xpeople.age_group = mx.age_group 
            and xpeople.height_group = mx.height_group      
            and xpeople.gender = mx.gender
            and xpeople.rec_num <= mx.rec_count 

感谢您的帮助!