为什么在CUDA上进行pytorch训练比在CPU中慢得多?

时间:2019-06-08 19:23:43

标签: performance pytorch

我想我用PyTorch简化了简单的神经网络,因为在CUDA上运行速度比在CPU中慢得多,请问您能找到错误吗?

之类的使用功能
    def backward(ctx, input):

        return backward_sigm(ctx, input)

似乎对性能没有真正的影响

import torch
import torch.nn as nn
import torch.nn.functional as f


dname = 'cuda:0'
dname = 'cpu'




device = torch.device(dname)


print(torch.version.cuda)

def forward_sigm(ctx, input):

    sigm = 1 / (1 + torch.exp(-input))

    ctx.save_for_backward(sigm)

    return sigm

def forward_step(ctx, input):

    return  torch.tensor(input > 0.5, dtype = torch.float32, device = device)


def backward_sigm(ctx, grad_output):

    sigm, = ctx.saved_tensors

    return grad_output * sigm * (1-sigm)


def backward_step(ctx, grad_output):

    return grad_output




class StepAF(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        return forward_sigm(ctx, input)


    @staticmethod
    def backward(ctx, input):

        return backward_sigm(ctx, input)
    #else return grad_output



class StepNN(torch.nn.Module):

    def __init__(self, input_size, hidden_size, output_size):
        super(StepNN, self).__init__()
        self.linear1 = torch.nn.Linear(input_size, hidden_size)
        #self.linear1.cuda()
        self.linear2 = torch.nn.Linear(hidden_size, output_size)
        #self.linear2.cuda()

        #self.StepAF = StepAF.apply



    def forward(self,x):

        h_line_1 = self.linear1(x)

        h_thrash_1 = StepAF.apply(h_line_1)

        h_line_2 = self.linear2(h_thrash_1)

        output = StepAF.apply(h_line_2)

        return output


inputs = torch.tensor( [[1,0,1,0],[1,0,0,1],[0,1,0,1],[0,1,1,0],[1,0,0,0],[0,0,0,1],[1,1,0,1],[0,1,0,0],], dtype = torch.float32, device = device)

expected = torch.tensor( [[1,0,0],[1,0,0],[0,1,0],[0,1,0],[1,0,0],[0,0,1],[0,1,0],[0,0,1],], dtype = torch.float32, device = device)


nn = StepNN(4,8,3)


#print(*(x for x in nn.parameters()))

criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(nn.parameters(), lr=1e-3)

steps = 50000

print_steps = steps // 20

good_loss = 1e-5

for t in range(steps):

    output = nn(inputs)
    loss = criterion(output, expected)



    if t % print_steps == 0:
        print('step ',t, ', loss :' , loss.item())

    if loss < good_loss:
        print('step ',t, ', loss :' , loss.item())
        break

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()



test = torch.tensor( [[0,1,0,1],[0,1,1,0],[1,0,1,0],[1,1,0,1],], dtype = torch.float32, device=device)


print(nn(test))

1 个答案:

答案 0 :(得分:1)

除非您有足够大的数据,否则使用GPU时不会看到任何性能改进。问题在于GPU使用并行处理,因此,除非您有大量数据,否则CPU可以像GPU一样快地处理样本。

据我在您的示例中看到的,您正在使用8个大小为(4,1)的样本。我可以想象,当拥有数百或数千个样本时,您会发现GPU的性能有所提高。在您的情况下,样本大小为(4,1),隐藏层大小为8,因此CPU可以相当快地执行计算。

网上有很多使用MNIST数据的示例笔记本(其中包含约60000张图像用于训练),因此您可以在Google Colab中加载一张,然后尝试在CPU上然后在GPU上进行训练,并观察训练时间。例如,您可以尝试this link。它使用TensorFlow代替PyTorch,但可以使您对GPU的性能有所了解。

注意:如果您以前从未使用过Google Colab,则需要在顶部的运行时菜单中更改运行时类型(对于CPU,GPU不适用,对于GPU,GPU不适用)。

此外,我将在这里将笔记本的结果发布到这里(看看括号中提到的时间,如果运行它,您可以直接看到它运行的速度):

在CPU上:

INFO:tensorflow:loss = 294.3736, step = 1
INFO:tensorflow:loss = 28.285727, step = 101 (23.769 sec)
INFO:tensorflow:loss = 23.518856, step = 201 (24.128 sec)

在GPU上:

INFO:tensorflow:loss = 295.08328, step = 0
INFO:tensorflow:loss = 47.37291, step = 100 (4.709 sec)
INFO:tensorflow:loss = 23.31364, step = 200 (4.581 sec)
INFO:tensorflow:loss = 9.980572, step = 300 (4.572 sec)
INFO:tensorflow:loss = 17.769928, step = 400 (4.560 sec)
INFO:tensorflow:loss = 16.345463, step = 500 (4.531 sec)