Assign missing variables values based on distribution SAS

时间:2016-07-11 19:13:55

标签: sas

I would like to assign IDs with blank Sizes a size based on the frequency distribution of their Group.

Dataset A contains a snapshot of my data:

ID  Group   Size
1   A       Large
2   B       Small
3   C       Small
5   D       Medium
6   C       Large
7   B       Medium
8   B       -

Dataset B shows the frequency distribution of the Sizes among the Groups:

Group   Small   Medium  Large
A       0.31    0.25    0.44
B       0.43    0.22    0.35
C       0.10    0.13    0.78
D       0.29    0.27    0.44

For ID 8, we know that it has a 43% probability of being "small", a 22% probability of being "medium" and a 35% probability of being "large". That's because these are the Size distributions for Group B.

How do I assign ID 8 (and other blank IDs) a Size based on the Group distributions in Dataset B? I'm using SAS 9.4. Macros, SQL, anything is welcome!

2 个答案:

答案 0 :(得分:0)

The table distribution is ideal for this. The last datastep here shows that; before that I set things up to create the data at random and determine the frequency table, so you can skip that if you already do that.

See Rick Wicklin's blog about simulating multinomial data for an example of this in other use cases (and more information about the function).

*Setting this up to help generate random data;
proc format;
  value sizef
  low - 1.3 = 'Small'
  1.3 <-<2.3  = 'Medium'
  2.3  - high = 'Large'
;
quit;

*Generating random data;
data have;
  call streaminit(7);
  do id = 1 to 1e5;
    group = byte(65+rand('Uniform')*4);   *A = 65, B = 66, etc.;
    size  = put((rank(group)-66)*0.5 + rand('Uniform')*3,sizef.);  *Intentionally making size somewhat linked to group to allow for differences in the frequency;
    if rand('Uniform') < 0.05 then call missing(size); *A separate call to set missingness;
    output;
  end;
run;

proc sort data=have;
  by group;
run;

title "Initial frequency of size by group";
proc freq data=have;
  by group;
  tables size/list out=freq_size;
run;
title;

*Transpose to one row per group, needed for table distribution;
proc transpose data=freq_size out=table_size prefix=pct_;
  var percent;
  id size;
  by group;
run;


data want;
  merge have table_size;
  by group;
  array pcts pct_:;  *convenience array;

  if first.group then do _i = 1 to dim(pcts);  *must divide by 100 but only once!;
    pcts[_i] = pcts[_i]/100;
  end;

  if missing(size) then do;
    size_new = rand('table',of pcts[*]);   *table uses the pcts[] array to tell SAS the table of probabilities;
    size = scan(vname(pcts[size_new]),2,'_');
  end;
run;



title "Final frequency of size by group";
proc freq data=want;
  by group;
  tables size/list;
run;
title;

答案 1 :(得分:0)

You can also do this with a random value and some if-else logic:

proc sql;
    create table temp_assigned as select
        a.*, rand("Uniform") as random_roll, /*generate a random number from 0 to 1*/
        case when missing(size) then
            case when calculated random_roll < small then small
                 when calculated random_roll < sum(small, medium) then medium
                 when calculated random_roll < sum(small, medium, large) then large
            end end as value_selected, /*pick the value of the size associated with that value in each group*/
        coalesce(case when calculated value_selected = small then "Small"
                     when calculated value_selected = medium then "Medium"
                     when calculated value_selected = large then "Large" end, size) as group_assigned /*pick the value associated with that size*/
        from temp as a
        left join freqs as b
        on a.group = b.group;
quit;

Obviously you can do this without creating the value_selected variable, but I thought showing it for demonstrative purposes would be helpful.

相关问题