Question

更新：我简要分析了问题文本底部的三个答案，并解释了我的选择。

我的问题：使用陈旧数据从随机区间数据集构建固定区间数据集的最有效方法是什么？

一些背景知识：以上是统计数据中的常见问题。通常，人们在随机时间发生一系列观察。称之为Input。但是人们希望每5分钟发生一系列观察。称之为Output。构建此数据集的最常用方法之一是使用陈旧数据，即将Output中的每个观察值设置为Input中最近发生的观察值。

因此，这里有一些构建示例数据集的代码：

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];

这两个数据集都是在千禧年之际接近午夜开始的。但是，Input中的时间戳以随机间隔发生，而Output中的时间戳以固定间隔发生。为简单起见，我确保Input中的第一次观察总是在Output中的第一次观察之前发生。随意在任何答案中做出这个假设。

目前，我解决了这个问题：

sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
    if Input(t, 1) > Output(s, 1)
        %#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
        Output(s, 2:end) = Input(t-1, 2:end);
        s = s + 1;
        %#Check if we've filled out all observations in output
        if s > sMax
            break
        end
        %#This step is necessary in case we need to use the same input observation twice in a row
        t = t - 1;
    end
    t = t + 1;
    if t > tMax
        %#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output 
        Output(s:end, 2:end) = Input(end, 2:end);
        break
    end
end

当然，有一种更有效，或至少更优雅的方法来解决这个问题？正如我所提到的，这是统计学中的常见问题。也许Matlab有一些我不知道的内置函数？任何帮助都会非常感激，因为我对一些大型数据集使用这个例程很多。

答案：大家好，我已经分析了三个答案，而且他们认为，Angainor是最好的。

ChthonicDaemon的回答虽然显然最容易实现，但确实很慢。即使在速度测试之外完成到timeseries对象的转换，也是如此。我猜这个resample函数目前有很多开销。我正在运行2011b，因此有可能Mathworks在此期间对其进行了改进。此外，对于Output在Input之后结束多个观察的情况，此方法还需要额外的一行。

Rody的答案只比Angainor稍微慢一点（因为他们都采用了histc方法，所以并不令人惊讶）但是，它似乎有些问题。首先，在Output中分配最后一次观察的方法对于在Input中的最后一次观察之后发生的Output中的最后一次观察并不稳健。这是一个简单的解决方案。但是我认为第二个问题源于InputTimeStamp作为histc的第一个输入而不是Angainor采用的OutputTimeStamp。如果在设置示例输入时将OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';更改为OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)';，则会出现此问题。

Angainor对于我所投入的一切看起来都很强大，而且它是最快的。

我针对不同的输入规格进行了大量的速度测试 - 以下数字具有相当的代表性：

我的天真循环：Elapsed time is 8.579535 seconds.

Angainor ：Elapsed time is 0.661756 seconds.

Rody：Elapsed time is 0.913304 seconds.

ChthonicDaemon：Elapsed time is 22.916844 seconds.

我正在使用Angainor的解决方案并标记问题已解决。

Answer 1

这种“陈旧数据”方法在信号和时间序列字段中称为zero order hold。快速搜索会带来许多解决方案。如果你有Matlab 2012b，那么这就是使用resample函数内置到timeseries类的内容，所以你只需要做

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
InputData = randn(TInput, 1);
InputTimeSeries = timeseries(InputData, InputTimeStamp);

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001);
OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold

Answer 2

这是我对这个问题的看法。 histc是要走的路：

% find Output timestamps in Input bins
N   = histc(Output(:,1), Input(:,1));

% find counts in the non-empty bins
counts = N(find(N));

% find Input signal value associated with every bin
val = Input(find(N),2);

% now, replicate every entry entry in val
% as many times as specified in counts
index = zeros(1,sum(counts));
index(cumsum([1 counts(1:end-1)'])) = 1;
index = cumsum(index);
val_rep = val(index)

% finish the signal with last entry from Input, as needed
val_rep(end+1:size(Output,1)) = Input(end,2);

% done
Output(:,2) = val_rep;

我检查了几个不同输入模型的程序（我更改了输出时间戳的数量），结果是相同的。但是，我仍然不确定我是否理解你的问题，所以如果出现问题，请告诉我。

使用陈旧数据从随机区间数据集构建固定区间数据集

2 个答案: