I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.
Below code wont work as ranuni is not supported by oracle
proc sql outobs =1000000;
select * from connection to oracle(
select * from tbl1 order by ranuni(12345);
quit;
How do you propose I do it
答案 0 :(得分:0)
据我所知,您需要大约700行的样本。让我们把它放在一个宏变量中(对于那些比SAS更熟悉其他语言的人来说,就像预编译变量一样。)
%let required_rows = 700;
现在计算这是多少数据(并希望ORACLE不扫描整个表)。我再次把它放在一个宏变量中。 (对于那些熟悉其他语言的人来说,是的SAS在执行代码时有填充宏变量的技术。这是可能的,因为SAS会逐步编译。)
proc sql;
select &required_rows / rows_tbl1
into :required_fraction
from connection to oracle
( select count(*) as rows_tbl1
from tbl1 );
最后关于那么多记录
select * from connection to oracle
( select *
from tbl1
where DBMS_RANDOM.VALUE < &required_fraction );
quit;
备注: