Question

对于vauge标题感到抱歉。

我的数据集看起来基本上是这样的：

我想要的是找到每个ID的最大值x。在此数据集中，ID = 18时为2，ID = 361时为3。

非常感谢任何反馈。

Answer 1

Proc表示类声明（因此您不必排序）并且请求最大统计量可能是最直接的方法（未经测试）：

data sample; 
    input id x; 
datalines; 
18  1 
18  1 
18  2 
18  1 
18  2 
369 2 
369 3 
369 3 
361 1 
; 
run; 


proc means data=sample noprint max nway missing; 
   class id;
   var x;
   output out=sample_max (drop=_type_ _freq_) max=;
run;

有关Proc Means（http://support.sas.com/onlinedoc/913/docMainpage.jsp）的详细信息，请查看在线SAS文档。

Answer 2

我不太明白你的例子。我无法想象输入数据集在一次观察中确实具有所有值。你有意这样吗？

data sample;
    input myid myvalue;
datalines;
18  1
18  1
18  2
18  1
18  2
369 2
369 3
369 3
361 1
;

proc sort data=sample;
    by myid myvalue;
run;

data result;
    set sample;
    by myid;

    if last.myid then output;
run;

proc print data=result;
run;

这会给你这个结果：

Obs    myid    myvalue

 1       18       2   
 2      361       1   
 3      369       3

Answer 3

如果你想保留所有记录和X的最大值，我会使用PROC MEANS aproach后跟一个merge语句，或者你可以先用Id和DESCENDING X对数据进行排序，然后再使用RETAIN语句直接在datastep中创建max_value：

PROC SORT DATA=A; BY ID DESCENDING X; RUN;

DATA B; SET A;
BY ID;
RETAIN  X_MAX;
IF FIRST.ID  THEN   X_MAX = X;
ELSE                X_MAX = X_MAX;
RUN;

Answer 4

你可以试试这个：

PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;

Answer 5

对于需要使用非常大的数据集执行此操作的任何人来说，可能会感兴趣的是一些过度设计的选项，其中性能更受关注：

如果您的数据集已按ID排序，但不是按每个ID中的X排序，您仍可以在单个数据步骤中执行此操作而不进行任何排序，并使用每个组中的保留最大值。或者，您可以使用proc方法（根据最佳答案）但使用by语句而不是class语句 - 这会减少内存使用量。

data sample; 
    input id x; 
datalines; 
18  1 
18  1 
18  2 
18  1 
18  2 
369 2 
369 3 
369 3 
361 1 
; 
run; 

data want;
  do until(last.ID);
    set sample;
    by ID;
    xmax = max(x, xmax);
  end;
  x = xmax;
  drop xmax;
run;

即使您的数据集未按ID排序，您仍然可以在一个数据步骤中执行此操作，而无需对其进行排序，方法是使用哈希对象跟踪您为每个ID找到的最大x值沿。这将比proc means快一点，并且通常会使用更少的内存，因为proc意味着在后台进行各种输出数据集中不需要的计算。

data _null_;
  set sample end = eof;
  if _n_ = 1 then do;
    call missing(xmax);
    declare hash h(ordered:'a');
    rc = h.definekey('ID');
    rc = h.definedata('ID','xmax');
    rc = h.definedone();
  end;
  rc = h.find();
  if rc = 0 then do;
    if x > xmax then do;
        xmax = x;
        rc = h.replace();
    end;
  end;
  else do;
    xmax = x;
    rc = h.add();
  end;
  if eof then rc = h.output(dataset:'want2');
run;

在这个例子中，在我的PC上，哈希方法使用了这么多内存：

   memory              966.15k
   OS Memory           27292.00k

VS。这对于一个等价的proc摘要来说很多：

   memory              8706.90k
   OS Memory           35760.00k

如果你真的需要它来扩大规模，那就不算太糟糕了！

Answer 6

对proc语句使用适当的by。例如，

data sample;
    input myid myvalue;
datalines;
18  1
18  1
18  2
18  1
18  2
369 2
369 3
369 3
361 1
;
run;

proc sort data=sample;
  by myid;
run;

proc means data=sample;
   var myvalue;
   by myid;
run;

Answer 7

我只想按x和id排序，将每个ID的最高值放在顶部。 NODUPKEY删除下面的每个副本。

proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;

proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;

Answer 8

如果您希望结果以最大id的价格显示每个 value，则应使用多数据哈希。也就是说，对于发现多个id具有最大value

的情况

示例代码：

查找与40个不同数字变量的最大值关联的ID。该代码是Proc DS2数据程序。

data have;
  call streaminit(123);

  do id = 1 to 1e5;                                  %* 10,000 rows;
    array v v1-v40;                                  %* 40 different variables;
    do over v; v=ceil(rand('uniform', 2e5)); end;
    output;
  end;
run;

proc ds2;
  data _null_;
    declare char(32) _name_ ;        %* global declarations;
    declare double value id;
    declare package hash result();

    vararray double v[*] v:;         %* variable based array, limit yourself to 1,000;
    declare double max[1000];        %* temporary array for holding the vars maximum values;

    method init();
      declare package sqlstmt s('drop table want');  %* DS2 version of `delete`;
      s.execute();

      result.keys([_name_]);                         %* instantiate a multidata hash;
      result.data([_name_ value id]);
      result.multidata();
      result.ordered('ascending');
      result.defineDone();
    end;

    method term();
      result.output('want');                         %* write the results to a table;
    end;

    method run();
      declare int index;
      set have;

      %* process each variable being examined for 'id at max';

      do index = 1 to dim(v);
        if v[index] > max[index] then do;         %* new maximum for this variable ?
          _name_ = vname(v[index]);               %* retrieve variable name;
          value = v[index];                       %* move value into hash host variable;
          if not missing (max[index]) then do;
            result.removeall();                   %* remove existing multidata items associated with the variable;
          end;
          result.add();                           %* add new multidata item to hash;
          max[index] = v[index];                  %* track new maximum;
        end;
        else 
        if v[index] = max[index] then do;         %* two or more ids have same max;
          _name_ = vname(v[index]);
          value = v[index];
          result.add();                           %* add id to the multidata item;
        end;
      end;
    end;
  enddata;
run;
quit;

%let syslast=want;

提醒：Proc DS2的默认设置是不覆盖现有表。要“覆盖”表，您需要执行以下操作之一：

在语法允许的情况下使用表选项overwrite=yes
- package hash .output()方法无法识别表格选项
在重新创建表格之前先删除表格

上面的代码可以在基本SAS DATA步骤中使用，只需稍作修改即可。

如何在“堆叠”数据集中查找每个唯一观察值的变量的最大值

8 个答案: