从MATLAB矩阵中提取数据而不进行for循环

时间:2013-06-12 16:03:33

标签: matlab matrix vectorization

在MATLAB中,我们假设我有一个10 x 100矩阵,称为M。我想做的是提取该矩阵的特定标记,并以矢量化方式立即对它们进行操作,基于行索引

例如,对于第一行,我想计算sum(M(1, 1:1:100))。然后对于第二行,我想要sum(M(2, 1:2:100))。对于第三行,我想要sum(M(3, 1:3:100))等等。对于第十行,我当然有sum(M(10, 1:10:100))

我在for循环中有这个,但是我想知道是否有一种方法可以在没有for循环的情况下提取这些数据。谢谢。

3 个答案:

答案 0 :(得分:5)

提议的解决方案

我终于破解了一个真正的矢量化解决方案,它使用logical indexing从输入矩阵中选择要求的元素。使用bsxfun及其可选的函数句柄@mod实现了这种魔力。代码列在下面 -

[m,n] = size(M);
mask = bsxfun(@mod,1:n,(1:m)')==1; %//'# all of magic happens here as it creates 
                             %// a logical mask of 1's at places in input matrix
                             %// whose elements are to be summed and 0's elsewhere.
mask(1,:) = 1; %// set the first row as all ones as we need to sum all of those
sumvals = sum(mask.*M,2); %// finally get the sum values

基准

在此基准测试部分中,我将介绍四种方法 - 本文前面列出的方法及其GPU移植版本, the other solution中列出了基于arrayfunsparse的方法。

三组输入数据用于基准测试 -

  • Set 1:相对于问题中使用的输入矩阵中的行数,列数保持多倍10
  • Set 2:扩展行数,使行数现为10x列数。这将真正测试循环代码,如 在这种情况下,迭代次数会更多。
  • Set 3:此集是set2的扩展,以进一步增加行数,因此将是真正的矢量化方法之间的另一个重要测试 反对他人。

下面列出了用于基准测试的功能代码 -

function sumvals = sumrows_stepped_bsxfun(M)
//... same as the code posted earlier
return

function sumvals = sumrows_stepped_bsxfun_gpu(M)
gM = gpuArray(M);
[m,n] = size(gM);
mask = bsxfun(@mod,gpuArray.colon(1,n),gpuArray.colon(1,m)')==1; %//'
sumvals = gather(sum(mask.*gM,2));
sumvals(1) = sum(M(1,:));
return

function S = sumrows_stepped_arrayfun(M)
[m,n] = size(M);
S = arrayfun(@(x) sum(M(x,1:x:n)), 1:m);
return

function B = sumrows_stepped_sparse(M)
sz = size(M);
A=sparse(sz(1),sz(2));
for n=1:sz(1),
    A(n, 1:n:end)=1;
end
B=full(sum(M.*A,2));
return

请注意,timeit用于计算CPU based代码,gputimeit代表GPU based代码。

用于测试的系统配置 -

MATLAB Version: 8.3.0.532 (R2014a)
Operating System: Windows 7
RAM: 3GB
CPU Model: Intel® Pentium® Processor E5400 (2M Cache, 2.70 GHz)
GPU Model: GTX 750Ti 2GB

由此获得的基准测试结果 -

enter image description here

enter image description here

enter image description here

结论

  1. 对于行数小于列数的数据,迭代次数很少,循环代码似乎占上风。

  2. 随着我们增加行数,真正的矢量化方法的好处变得清晰。您还会注意到基于bsxfun on CPU的方法适用于第3组,直到大约12000 x 300标记非向量化方法,其背后的原因是,bsxfun创建了这个巨大的逻辑掩码,并且指出内存带宽要求太高,无法应对bsxfun的计算能力。这是有道理的,因为定义的矢量化操作意味着一次性对许多元素执行操作,因此内存带宽是必不可少的。所以,如果你有一台更好的RAM机器,那么12000 x 300标记应该进一步延伸。

  3. 如果可以进一步扩展行数,只要控制内存带宽,矢量化解决方案的好处就会变得更加清晰。

  4. 基准代码

    如果有人想在他们的系统上测试它,那么这里是基准测试代码 -

    clear all; clc; close all
    
    outputfile = 'results.xlsx';
    delete(outputfile); %// remove file, so that new results could be written into
    
    base_datasize_array = 40:60:400;
    methods = {'BSXFUN on GPU','BSXFUN on CPU','ARRAYFUN','SPARSE'};
    num_approaches = numel(methods);
    num_sets = 3;
    
    timeall_all = zeros(num_approaches,numel(base_datasize_array),num_sets);
    datasize_lbs = cell(numel(base_datasize_array),num_sets);
    for set_id = 1:num_sets
        switch set_id
            case 1
                N1_arr = base_datasize_array*2;
                N2_arr = N1_arr*10;
            case 2
                N2_arr = base_datasize_array*2;
                N1_arr = N2_arr*10;
            case 3
                N2_arr = base_datasize_array;
                N1_arr = N2_arr*40;
        end
    
        timeall = zeros(num_approaches,numel(N1_arr));
        for iter = 1:numel(N1_arr)
            M = rand(N1_arr(iter),N2_arr(iter));
    
            f = @() sumrows_stepped_bsxfun_gpu(M);
            timeall(1,iter) = gputimeit(f); clear f
    
            f = @() sumrows_stepped_bsxfun(M);
            timeall(2,iter) = timeit(f); clear f
    
            f = @() sumrows_stepped_arrayfun(M);
            timeall(3,iter) = timeit(f); clear f
    
            f = @() sumrows_stepped_sparse(M);
            timeall(4,iter) = timeit(f); clear f
    
        end
        timeall_all(:,:,set_id) = timeall;
    
        wp = repmat({' '},numel(N1_arr),1);
        datasize_lbs(:,set_id) = strcat(cellstr(num2str(N1_arr.')),' x ',...
            wp,cellstr(num2str(N2_arr.')));
    end
    
    for set_id=1:num_sets
        out_cellarr = cell(numel(methods)+1,numel(N1_arr)+1);
        out_cellarr(1,1) = {'Methods'};
        out_cellarr(2:end,1) = methods;
        out_cellarr(1,2:end) = datasize_lbs(:,set_id);
        out_cellarr(2:end,2:end) = cellfun(@(x) num2str(x),...
            num2cell(timeall_all(:,:,set_id)),'Uni',0);
        xlswrite(outputfile, out_cellarr,set_id);
    end
    

答案 1 :(得分:3)

您可以尝试使用单行

S=arrayfun(@(n) sum(M(n,1:n:100)), 1:10)

或者,您可以事先创建稀疏矩阵

A=sparse(100,10);
for n=1:10, 
   A(1:n:100, n)=1; 
end

并按

查找总和
S=diag(M*A);

通过定义A=sparse(10,100)

,可以进一步优化大型矩阵
S=sum(M.*A,2);

我快速的benchamrking

M=rand(10,100);
sz = size(M);
tic;
for k=1:10000,
    for n=1:sz(1),
        B(n)=sum(M(n,1:n:end));
    end
end
toc

tic;
for k=1:10000,
    B=arrayfun(@(n) sum(M(n,1:n:end)), 1:sz(1));
end
toc

tic;
for k=1:10000,
    A=sparse(sz(2), sz(1));
    for n=1:sz(1),
        A(1:n:end, n)=1;
    end
    B=diag(M*A);
end
toc

tic;
A=sparse(sz(2),sz(1));
for n=1:sz(1),
    A(1:n:end, n)=1;
end
for k=1:10000,
    B=diag(M*A);
end
toc

tic;
A=sparse(sz(1),sz(2));
for n=1:sz(1),
    A(n, 1:n:end)=1;
end
for k=1:10000,
    B=sum(M.*A,2);
end
toc

返回

Elapsed time is 0.552470 seconds.
Elapsed time is 2.409102 seconds.
Elapsed time is 0.638072 seconds.
Elapsed time is 0.052246 seconds.
Elapsed time is 0.061893 seconds.

30-by-1000矩阵

Elapsed time is 1.785664 seconds.
Elapsed time is 3.954034 seconds.
Elapsed time is 4.760436 seconds.
Elapsed time is 0.926118 seconds.
Elapsed time is 0.865330 seconds.

和1000×100矩阵

Elapsed time is 51.389322 seconds.
Elapsed time is 63.443414 seconds.
Elapsed time is 68.327187 seconds.
Elapsed time is 29.056304 seconds.
Elapsed time is 1.147215 seconds.

答案 2 :(得分:1)

由于稀疏/矩阵方法有一个有趣的性能效果,我会发布一些adidtional结果:

M  = rand(1000,100);
sz = size(M);

% PLAIN LOOP
tic
out1 = zeros(sz(1),1);
for k = 1:10000
    for n = 1:sz(1)
        out1(n) = sum(M(n,1:n:100));
    end
end
toc

% SPARSE MATRIXMULT
tic
A = sparse(sz);
for n = 1:sz(1)
    A(1:n:sz(2),n) = 1;
end
for k = 1:10000
    out2 = diag(M*A);
end
toc

isequal(out1,out2) % ok  

Plain loop:        11.441380 seconds.
Sparse/matrixmult: 27.503829 seconds.

随着矩阵维数的增长,普通循环更有效。