Question

我最近一直在进行图像转换，遇到这样的情况：我有一个大型数组（形状为100,000 x 3），其中每一行代表3D空间中的一个点，例如：

pnt = [x y z]

我要做的就是遍历每个点和矩阵，将每个点与一个称为T（形状= 3 X 3）的矩阵相乘。

用numpy测试：

def transform(pnt_cloud, T):
    
    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = np.dot(T, pnt)
        
        if xyz_pnt[0] > 0:
            arr[i] = xyz_pnt[0]
            
        i += 1
           
    return arr

调用以下代码并计算运行时间（使用％time）可得到输出：

Out[190]: CPU times: user 670 ms, sys: 7.91 ms, total: 678 ms
Wall time: 674 ms

使用Pytorch张量进行测试：

import torch

tensor_cld = torch.tensor(pnt_cloud)
tensor_T   = torch.tensor(T)

def transform(pnt_cloud, T):
    depth_array = torch.tensor(np.zeros(pnt_cloud.shape[0]))

    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = torch.matmul(T, pnt)
        
        if xyz_pnt[0] > 0:
            depth_array[i] = xyz_pnt[0]
            
        i += 1
            
        
    return depth_array

调用以下代码并计算运行时间（使用％time）可得到输出：

Out[199]: CPU times: user 6.15 s, sys: 28.1 ms, total: 6.18 s
Wall time: 6.09 s

注意：使用torch.jit只能减少2秒

由于PyTorch在编译阶段将其代码分解的方式，我会认为PyTorch张量计算会更快。我在这里想念什么？

除了使用Numba之外，还有其他更快的方法吗？

Answer 1

为什么要使用for循环？
为什么您要计算3x3点积，而只使用结果的第一个元素？

您可以在一个matmul中完成所有数学运算：

with torch.no_grad():
  depth_array = torch.matmul(pnt_cloud, T[:1, :].T)  # nx3 dot 3x1 -> nx1
  # since you only want non negative results
  depth_array = torch.maximum(depth_array, 0)

由于要将运行时与numpy进行比较，因此应禁用gradient accumulation。

Answer 2

为了提高速度，我从PyTorch论坛获得了以下回复：

操作通常相当昂贵，因为创建Tensor的开销变得非常大（这包括设置单个元素），我认为这是这里的主要内容。这也是为什么JIT并不能帮助很多（它只占用了Python的开销）而Numby却闪耀（例如，对depth_array [i]的分配只是内存写入）的原因。
如果您在PyTorch和NumPy中使用不同的BLAS后端，则matmul本身的速度可能会有所不同。

为什么通过pytorch张量循环如此缓慢（与Numpy相比）？

用numpy测试：

使用Pytorch张量进行测试：

2 个答案: