SAS中两个数据集的比较

时间:2014-07-24 14:48:04

标签: sas

我有以下数据集:

data data_one;
length X 3
Y $ 20; 

input x y ;

datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1

    run;

data data_two;
length Z 3
       A $ 20;

input Z A;

datalines;
1 test
2 test1
3 test2
run;

我想要的是一个数据集,告诉我data_one中的列Y在data_two中包含相同的A列字符串的频率。结果应如下所示:

 Obs    test    test1    test2

  1       4       3        0

提前致谢!

1 个答案:

答案 0 :(得分:1)

  1. 首先,我们需要对data_one中存在的Y值进行计数。
  2. 然后我们创建data_two中存在的值的排序(用于下一个合并)列表。
  3. 从1开始的data_one Y计数与2中的列表合并。 data_two中但不在data_one (b and not a)中的Y值被赋予count = 0,data_two中不存在的Y值被丢弃(if b)。
  4. 最后一段在一组水平变量中转换垂直计数列表。

  5. proc freq data=data_one noprint;
        table y / out=count_one (keep=y count);
    run;
    proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
        by a;
    run;
    data count_all;
        merge count_one (in=a) list_two (in=b);
        by y;
        if (b and not a) then count=0;
        if b;
    run;
    proc transpose data=count_all out=final (drop=_name_ _label_);
        id y;
    run;
    

    前三个步骤可以用一个proc SQL代替:

    proc sql;
        create table count_all as
        select distinct
                coalesce(t1.y,t2.a) as y,
                case
                    when missing(t1.y) then 0 
                    else count(t1.y)
                end as N
            from data_one as t1
            right join data_two as t2
                on t1.y=t2.a
            group by 1
            order by 1;
    quit;
    proc transpose data=count_all out=final (drop=_name_);
        id y;
    run;