Question

这是Jupyter的摘录：

在[1]中：

import torch, numpy as np, datetime
cuda = torch.device('cuda')

在[2]中：

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

墙壁时间：349毫秒

tensor（17.0374，device ='cuda：0'）张量（17.0376，device ='cuda：0'）

时间很短，但仍然很合理（1e12乘法需要0.35秒）

但是，如果我们重复相同的话：

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

挂墙时间：999 µs

张量（-78.7172，device ='cuda：0'）张量（-78.7173，device ='cuda：0'）

1e12中的{p> 1ms乘法？！

为什么时间从349ms变为1ms？

信息：

在GeForce RTX 2070上测试；
可以在Google Colab上复制。

Answer 1

在讨论PyTorch：Measuring GPU tensor operation speed上已经有关于此的讨论。

我想强调该线程的两个评论：

来自@apaszke：

[...] GPU异步执行所有操作，因此您需要为基准测试正确插入适当的障碍

来自@ngimel：

我相信现在cublas句柄的分配是延迟的，这意味着需要cublas句柄的第一个操作将具有创建cublas句柄的开销，并且其中包括一些内部分配。因此，除了在计时循环之前调用某些需要cublas的函数之外，没有其他方法可以避免这种情况。

基本上，您必须synchronize()才能进行适当的测量：

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU时间：用户288毫秒，sys：191毫秒，总计：479毫秒

挂墙时间：492毫秒

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU时间：用户237毫秒，系统时间：231毫秒，总计：468毫秒

挂墙时间：469毫秒

Answer 2

医生说：

torch.cuda.synchronize()

等待CUDA设备上所有流中的所有内核完成。

实际上，这告诉Python：停止，然后等待操作完全完成。

否则，%time会在发出命令后立即返回。

这将是测试时间的正确方法。请注意两次torch.cuda.synchronize()，第一次等待张量在cuda上移动，第二次等待直到命令在GPU上完成。

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
torch.cuda.synchronize()

%timeit -n 10 y = x.matmul(w.t()); torch.cuda.synchronize() #10 loops, best of 3: 531 ms per loop

Answer 3

我猜是

GPU内存缓存。每次运行后，尝试使用torch.cuda.empty_cache（）。

如何在接近零的时间内用火炬将两个10000 * 10000矩阵相乘？为什么速度从349 ms下降到999 µs如此之大？

3 个答案: