Question

我有一个问题，我认为没有太多要纠正才能正常工作。我有桌子（带有所需的输出栏＆＃39; sum_usage＆＃39;）：

id  opt    t_purchase            t_spent       bonus   usage sum_usage
a    1  10NOV2017:12:02:00  10NOV2017:14:05:00   100     9        15
a    1  10NOV2017:12:02:00  10NOV2017:15:07:33   100     0        15
a    1  10NOV2017:12:02:00  10NOV2017:13:24:50   100     6        6
b    1  10NOV2017:13:54:00  10NOV2017:14:02:58   100     3        10
a    1  10NOV2017:12:02:00  10NOV2017:20:22:07   100    12        27
b    1  10NOV2017:13:54:00  10NOV2017:13:57:12   100     7 .      7

所以，我需要总结来自time_purchase的所有使用值（对于一个id，opt组合（group by id，opt），只有一个唯一的time_purchase）直到t_spent。此外，我有大约数百万行，所以哈希表将是最好的解决方案。我试过了：

data want;
 if _n_=1 then do;
  if 0 then set have(rename=(usage=_usage));
  declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
  h.definekey('id','opt', 't_purchase', 't_spent');
  h.definedata('_usage');
  h.definedone();
 end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
 if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;

底部的第五行肯定不正确(do i=intck('second', t_purchase, t_spent)，但不知道如何处理此问题。所以，主要问题是如何设置时间间隔来计算这个。我在这个哈希表函数中已经有一个具有相同键的函数，但没有时间间隔，所以编写这个函数也不错，但是没有必要。

Answer 1

就个人而言，我会抛弃哈希并使用SQL。

示例数据：

data have;

input id $ opt    
    t_purchase  datetime20.
    t_spent     datetime20.
    bonus   usage sum_usage;

format 
    t_purchase  datetime20.
    t_spent     datetime20.;

datalines;
a    1  10NOV2017:12:02:00  10NOV2017:14:05:00   100     9        15
a    1  10NOV2017:12:02:00  10NOV2017:15:07:33   100     0        15
a    1  10NOV2017:12:02:00  10NOV2017:13:24:50   100     6        6
b    1  10NOV2017:13:54:00  10NOV2017:14:02:58   100     3        10
a    1  10NOV2017:12:02:00  10NOV2017:20:22:07   100    12        27
b    1  10NOV2017:13:54:00  10NOV2017:13:57:12   100     7       7
;

我要离开您的sum_usage专栏进行比较。

现在，创建一个总和表。新值为sum_usage2。

proc sql noprint;
create table sums as
select a.id,
       a.opt,
       a.t_purchase,
       a.t_spent,
       sum(b.usage) as sum_usage2
    from have as a,
         have as b
    where a.id = b.id
      and a.opt = b.opt
      and b.t_spent <= a.t_spent
      and b.t_spent >= a.t_purchase
    group by a.id, 
       a.opt,
       a.t_purchase,
       a.t_spent;
quit;

现在您有了总和，请将它们连接回原始表：

proc sql noprint;
create table want as
select a.*,
       b.sum_usage2
    from have as a
      left join
         sums as b
      on a.id = b.id
      and a.opt = b.opt
      and a.t_spent = b.t_spent
      and a.t_purchase = b.t_purchase;
quit;

这会生成您想要的表格。或者，您可以使用哈希来查找值并在数据步骤中添加总和（给定大小可以更快）。

data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
    %create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;

rc = lk.find();

drop rc;
run;

这里有

%create_hash()宏https://github.com/FinancialRiskGroup/SASPerformanceAnalytics

Answer 2

我相信这个问题是你早期的一个变形，你可以通过对数据集中每条记录的3小时内每秒进行哈希查找来计算滚动总和。希望您意识到该方法的简单性每个记录需要大量3 * 3600个散列查找，并且必须将整个数据向量加载到散列中。

显示的时间日志数据在数据顶部插入了新记录，我假设数据在时间上单调下降。

数据步骤可以在单次通过单调数据时计算时间窗口内的滚动总和。该技术使用“环”阵列，其中索引推进由模数调整。一个数组用于时间，另一个用于度量（用法）。所需的数组大小是时间窗口内可能出现的最大项目数。

考虑一些生成的样本数据，时间步长为1,2，一次跳跃为200秒：

data have;
  time = '12oct2017:11:22:32'dt;
  usage = 0;
  do _n_ = 1 to &have_count;
     time + 2; *ceil(25*ranuni(123));
     if _n_ > 30 then time + -1;
     if _n_ = 145 then time + 200;
     usage = floor(180*ranuni(123));
     delta = time-lag(time);
     output;
  end;
run;

从排序时间上升的前一项计算滚动总和的情况开始。（下降案例将随之而来）：

示例参数为RING_SIZE 16和TIME_WINDOW为12秒。

%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;

data want;
  array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
  array ring_time  [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);

  retain ring_tail 0 ring_head -1 span 0 span_usage 0;

  set have;
  by time ; * cause error if data not sorted per algorithm requirement;

  * unload from accumulated usage the tail items that fell out the window;
  do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
    span + -1;

    span_usage + -ring_usage(ring_tail);
    ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
  end;

  ring_head = mod ( ring_head + 1, &RING_SIZE );
  span + 1;

  if span > 1 and (ring_head = ring_tail) then do;
    _n_ = dim(ring_time);
    put 'ERROR: Ring array too small, size=' _n_;
    abort cancel;
  end;

  * update the ring array;
  ring_time(ring_head) = time;
  ring_usage(ring_head) = usage;

  span_usage + usage;

  drop ring_tail ring_head span;
run;

对于按降序排序的数据，你可以摇摆一些东西;升序，计算滚动和度假下降。

如果无法完成这样的抖动怎么办，或者你只想要一次通过？

作为滚动计算一部分的项目必须来自“前导”行或尚未通过SET读取的行。这怎么可能？第二个SET语句可用于打开到数据集的单独通道，从而获得前导值。

处理潜在客户数据需要更多的记账 - 需要处理数据末尾的过早覆盖和缩小窗口。

data want2;
  array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
  array ring_time  [-1:%eval(&RING_SIZE-1)] _temporary_;

  retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;

  set have;

&debug put / _N_ ':' time= ring_head=;

  * unload ring_head slotted item from sum;
  span + -1;
  span_usage + -ring_usage(ring_head);

  * advance ring_head slot by 1, the vacated slot will be overwritten by lead;
  ring_head = mod ( ring_head + 1, &RING_SIZE ); 

&debug put +2 ring_time(ring_head)= span= 'head';

  * load ring with lead values via a second SET of the same data;
  if not end2 then do;

    do until (_n_ > 1 or lead_index = 0 or end2);
      set have(keep=time usage rename=(time=t usage=u)) end=end2;  * <--- the second SET ;

      if end2 then guard_index = lead_index;

&debug if end2 then put guard_index=;

      ring_time(lead_index) = t;
      ring_usage(lead_index) = u;

&debug put +2 ring_time(lead_index)=  'lead';

      lead_index = mod ( lead_index + 1, &RING_SIZE);
    end;
  end;

  * advance ring_tail to cover the time window;
  if ring_tail ne guard_index then do;

      ring_tail_was = ring_tail;
      ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;

      do while (time - ring_time(ring_tail) <= &TIME_WINDOW);

          span + 1;
          span_usage + ring_usage(ring_tail);

&debug put +2 ring_time(ring_tail)= span= 'seek';

          ring_tail_was = ring_tail;
          ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;

          if ring_tail_was = guard_index then leave;

          if span > 1 and (ring_head = ring_tail) then do;
            _n_ = dim(ring_time);
            put 'ERROR: Ring array too small, size=' _n_;
            abort cancel;
          end;
      end;

      * seek went beyond window, back tail off to prior index;
      ring_tail = ring_tail_was;

  end;

&debug put +2 ring_time(ring_tail)= span= 'mark';

  drop lead_index t u ring_: guard_index span;

  format ring: span: usage 6.;
run;
options source;

确认两种方法具有相同的计算结果：

proc sort data=want2; by time;
run;

proc compare noprint data=want compare=want2 out=diff outnoequal;
  id time;
  var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.

我没有对环数组进行基准测试，而不是使用Proc EXPAND和Hash进行比较。

警告：在处理非整数值时，使用+ in和-out操作的航位推算滚动值可能会遇到舍入错误。

在SAS中计算时间间隔中一列的滚动总和

2 个答案: