忽略NaN的相关矩阵

时间:2015-04-13 21:00:17

标签: matlab correlation

我正在使用matlab,我有一个(60x882)矩阵,我需要计算列之间的成对相关性。但是我想忽略所有具有NaN或更多NaN的列(即任何一对列的结果,其中至少有一个条目是NaN应该是NaN)。

到目前为止,这是我的代码:

for i=1:size(auxret,2)
    for j=1:size(auxret,2)
        rho(i,j)=corr(auxret(:,i),auxret(:,j));
        end
    end
end

但这是非常无效的。我考虑过使用这个功能:

corr(奥黛尔,'行','成对'); 但它没有产生相同的结果(它忽略了NaN但仍然计算相关性 - 所以除非一列的所有条目都是NaN,否则它仍然会给出一个输出)。

有关如何提高效率的任何建议?

3 个答案:

答案 0 :(得分:2)

要使用corr(auxret, 'rows','pairwise')获取与代码相同的输出,以下操作

auxret(:,any(isnan(auxret))) = NaN;
r = corr(auxret, 'rows','pairwise');

答案 1 :(得分:0)

这是一种有效的方法,特别是在处理涉及NaNs -

的输入数据时
%// Get mask of invalid columns and thus extract columns without any NaN
mask = any(isnan(auxret),1);
A = auxret(:,~mask);

%// Use correlation formula to get correlation outputs for valid columns
n = size(A,1);
sum_cols = sum(A,1);
sumsq_sqcolsum = n*sum(A.^2,1) - sum_cols.^2;

val1 = n.*(A.'*A) - bsxfun(@times,sum_cols.',sum_cols);      %//'
val2 = sqrt(bsxfun(@times,sumsq_sqcolsum.',sumsq_sqcolsum)); %//'
valid_outvals = val1./val2;

%// Setup output array and store the valid outputs in it
ncols = size(auxret,2);
valid_idx = find(~mask);
out = nan(ncols);
out(valid_idx,valid_idx) = valid_outvals;

基本上,作为预处理步骤,它完全删除具有一个或多个NaNs的所有列并计算相关输出。然后我们使用适当大小初始化NaNs的输出数组,并在适当的位置将有效输出放回其中。


基准

无论您是使用循环方法还是使用可选corr(auxret, 'rows','pairwise'),结果似乎都是有效的。但是,这里有一个很大的问题:即使只有一个NaN 在任何一列中都会使性能降低很多,并且这种性能下降对于原始的loopy方法来说是巨大的,并且我们将使用rows + pairwise选项仍然很大 然后在基准测试结果中找到答案。

基准代码

nrows = 60;
ncols = 882;
percent_nans = 1; %// decides the percentage of NaNs in input

auxret = rand(nrows,ncols);
auxret(randperm(numel(auxret),round((percent_nans/100)*numel(auxret))))=nan;

disp('------------------------------- With Proposed Approach')
tic
%// Solution code from earlier
toc

disp('------------------------------- With ROWS + PAIRWISE Approach')
tic
auxret(:,any(isnan(auxret))) = NaN;
out1 = corr(auxret, 'rows','pairwise');
toc

disp('------------------------------- With Original Loopy Approach')
tic
out2 = zeros(size(auxret,2));
for i=1:size(auxret,2)
    for j=1:size(auxret,2)
        out2(i,j)=corr(auxret(:,i),auxret(:,j));
    end
end
toc

因此,根据输入数据和NaNs的百分比,很少有可能的情况,相应地我们有运行时结果 -

案例1:输入为6 x 88,NaNs的百分比为10

------------------------------- With Proposed Approach
Elapsed time is 0.006371 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.052563 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.875620 seconds.

案例2:输入为6 x 88,NaNs的百分比为1

------------------------------- With Proposed Approach
Elapsed time is 0.006303 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.049194 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.871369 seconds.

案例3:输入为6 x 88,NaNs的百分比为0.001

------------------------------- With Proposed Approach
Elapsed time is 0.006738 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.025754 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.867647 seconds.

案例4:输入为60 x 882,NaNs的百分比为10

------------------------------- With Proposed Approach
Elapsed time is 0.007766 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.479645 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

案例5:输入为60 x 882,NaNs的百分比为1

------------------------------- With Proposed Approach
Elapsed time is 0.014144 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.324878 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

案例6:输入为60 x 882,NaNs的百分比为0.001

------------------------------- With Proposed Approach
Elapsed time is 0.020410 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 1.830632 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

答案 2 :(得分:0)

您所描述的是corr的默认行为,没有任何特殊选项。例如,

auxret =  [8     2     3
           3     5     NaN
           7    10     3
           7     4     6
           2     6     7];

rho = corr(auxret)

结果

rho =

    1.0000   -0.1497       NaN
   -0.1497    1.0000       NaN
       NaN       NaN       NaN