Question

我有一个约有74,000,000个观测值的数据集（LRG_DS）。数据集已由具有大约7500个唯一值的变量（I_VAR1）索引。我通过在数据集中运行proc内容发现了这一点。

我想创建一个仅包含索引变量的7000个唯一值的数据集（TEMP）。

我尝试了以下操作：

data TEMP; 
   set LRG_DS (keep = I_VAR1);  
   by I_VAR1;   
   if first.I_VAR1; 
   run;

和

proc sort data = LRG_DS nodupkey out = TEMP (keep = I_VAR1); 
   by I_VAR1;
   run;

第一种方法大约需要46秒，第二种方法大约需要55秒。

我读到sas7bndx是文件，不是要单独检查的文件，而应该是一个文件，以加快使用index变量执行的某些过程。

非常感谢您的帮助！

Answer 1

YMMV，但使用填充具有唯一键值的空哈希表可能比排序更好。

创建一些示例数据：

data x;
  do cnt=1 to 10*100000;
    var=round(rand('uniform'),0.001);
    do cnt2=1 to 10;
      output;
    end;
    drop cnt2;
  end;
run;

使用proc sort测试速度：

proc sort data=x(keep=var) out=sorted nodupkey;
  by var;
run;

与哈希表版本进行比较：

data _null_;

   set x(keep=var) end=eof;

   if _n_ eq 1 then do;
     declare hash ht ();
     rc = ht.DefineKey ('var');
     rc = ht.DefineDone ();
   end; 
   if ht.check() ne 0 then do;
     rc = ht.add();   
   end;
   if eof then do;
     ht.output(dataset:"ids");
   end;
run;

从我的简短测试中，我发现，随着唯一值数量的增加，哈希表版本开始表现更差。可以通过预先适当地设置哈希大小来弥补这一点，但我没有进行测试。

使用索引变量的唯一值创建数据集

1 个答案: