simple random sampling while pulling data from warehouse(oracle engine) using proc sql in sas

时间:2017-08-04 13:35:16

标签: sampling sas oracle

I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.

Below code wont work as ranuni is not supported by oracle

    proc sql outobs =1000000;
    select * from connection to oracle(
    select * from tbl1 order by ranuni(12345);
    quit;

How do you propose I do it

1 个答案:

答案 0 :(得分:0)

据我所知,您需要大约700行的样本。让我们把它放在一个宏变量中(对于那些比SAS更熟悉其他语言的人来说,就像预编译变量一样。)

%let required_rows = 700;

现在计算这是多少数据(并希望ORACLE不扫描整个表)。我再次把它放在一个宏变量中。 (对于那些熟悉其他语言的人来说,是的SAS在执行代码时有填充宏变量的技术。这是可能的,因为SAS会逐步编译。)

proc sql;
    select &required_rows / rows_tbl1 
    into :required_fraction
    from connection to oracle
    (   select count(*) as rows_tbl1
        from tbl1 );

最后关于那么多记录

    select * from connection to oracle
    (   select *
        from tbl1 
        where DBMS_RANDOM.VALUE < &required_fraction );
quit;

备注:

  • 我没有测试此代码。
  • 如果您只需要700行,则可以将所需分数提高两倍并随机在SAS中进行采样
相关问题