Question

我有一个执行以下功能的for循环：

采用M×8矩阵并且：

将其拆分为大小为512个元素的块（表示矩阵的X乘以8 == 512，元素数量可以是128,256,512,1024,2048）
将块重新整形为1乘512（元素数）矩阵。
取最后1/4的矩阵并将其放在前面，
例如Data = [Data(1,385:512),Data(1,1:384)];

以下是我的代码：

for i = 1 : NumOfBlock  
    if i == 1  
        Header = tempHeader(1:RowNeeded,:);  
        Header = reshape(Header,1,BlockSize); %BS  
        Header = [Header(1,385:512),Header(1,1:384)]; %CP  
        Data = tempData(1:RowNeeded,:);  
        Data = reshape(Data,1,BlockSize); %BS  
        Data = [Data(1,385:512),Data(1,1:384)]; %CP  
        start = RowNeeded + 1;  
        end1 = RowNeeded * 2;  
    else  
        temp = tempData(start:end1,:);  
        temp = reshape(temp,1,BlockSize); %BS  
        temp = [temp(1,385:512),temp(1,1:384)]; %CP  
        Data = [Data, temp];  
    end

    if i <= 127 & i > 1
        temp = tempHeader(start:end1,:);
        temp = reshape(temp,1,BlockSize); %BS
        temp = [temp(1,385:512),temp(1,1:384)]; %CP
        Header = [Header, temp];
    end

    start = end1 + 1;
    end1=end1 + RowNeeded;  
end

使用500万个元素运行此循环将花费超过1小时。我需要它尽可能快（秒）。这个循环能够被矢量化吗？

Answer 1

根据您的功能描述，这是我想出的：

M = 320;           %# M must be divisble by (numberOfElements/8)
A = rand(M,8);     %# input matrix

num = 512;         %# numberOfElements
rows = num/8;      %# rows needed

%# equivalent to taking the last 1/4 and putting it in front
A = [A(:,7:8) A(:,1:6)];

%# break the matrix in blocks of size (x-by-8==512) into the third dimension
B = permute(reshape(A',[8 rows M/rows]),[2 1 3]);

%'# linearize everything
B = B(:);

这个图可能有助于理解上述内容：

Answer 2

矢量化可能会有所帮助，也可能没有帮助。有用的是了解哪里瓶颈。使用此处概述的分析器：

http://blogs.mathworks.com/videos/2006/10/19/profiler-to-find-code-bottlenecks/

Answer 3

如果你告诉你要做的事情会很好（我的猜测是动力系统中的一些模拟，但很难说）。

是的，当然它可以被矢量化：你的每个块实际上是四个子块;使用您的（极不标准）指数：

1 ... 128,129 ... 256,257 ...... 384,385 ... 512

矢量化的每个内核/线程/你曾经调用它应该执行以下操作：

i = threadIdx介于0和127之间 temp = data [1 + i] 数据[1 + i] =数据[385 + i] 数据[385 + i] =数据[257 + i] 数据[257 + i] =数据[129 + i] data [129 + i] = temp

你当然也应该在块上并行化，而不仅仅是矢量化。

Answer 4

我再次感谢Amro给我一个如何解决我的问题的想法。很抱歉没有在问题中说清楚。

以下是我解决问题的方法：

%#BS CDMA, Block size 128,512,1024,2048  
  BlockSize = 512;  
  RowNeeded = BlockSize / 8;  
  TotalRows = size(tempData);  
  TotalRows = TotalRows(1,1);  
  NumOfBlock = TotalRows / RowNeeded;  
  CPSize = BlockSize / 4;  

%#spilt into blocks  
  Header = reshape(tempHeader',[RowNeeded,8, 128]);  
  Data = reshape(tempData',[RowNeeded,8, NumOfBlock]);  
  clear tempData tempHeader;  

%#block spread & cyclic prefix  
    K = zeros([1,BlockSize,128],'single');  
    L = zeros([1,BlockSize,NumOfBlock],'single');  
       for i = 1:NumOfBlock  
           if i <= 128  
              K(:,:,i) = reshape(Header(:,:,i),[1,BlockSize]);  
              K(:,:,i) = [K((CPSize*3)+1:BlockSize),K(1:CPSize*3)];
           end  
           L(:,:,i) = reshape(Data(:,:,i),[1,BlockSize]);  
           L(:,:,i) = [L((CPSize*3)+1:BlockSize),L(1:CPSize*3)];
        end

可以对以下循环进行矢量化吗？

4 个答案: