Question

将模型包装到DataParallel中之后

model20 = FourLayerConvNetWithPool()
model20 = nn.DataParallel(model20)

我可以看到所有可用GPU的利用率。但经过仔细检查后发现，“ cuda：0”的利用率仍然较高（因此，这是一个瓶颈{对于大型网络来说可能很小）}。

这是训练迭代的片段：

model.train()  # put model to training mode
x = x.to(device=self.device, dtype=self.dtype)  # move to device, e.g. GPU
y = y.to(device=self.device, dtype=torch.long)

scores = model(x)
loss = F.cross_entropy(scores, y)

if (reg > 0):
    l2_regularization = torch.tensor(0).to(device=self.device, dtype=self.dtype)
    for param in model.parameters():
        l2_regularization += torch.norm(param, 2)
    loss += reg * l2_regularization

# Zero out all of the gradients for the variables which the optimizer
# will update.
optimizer.zero_grad()

# This is the backwards pass: compute the gradient of the loss with
# respect to each  parameter of the model.
loss.backward()

# Actually update the parameters of the model using the gradients
# computed by the backwards pass.
optimizer.step()

通过评论以上摘录的各个部分，发现F.cross_entropy和torch.norm有助于“ cuda：0”和“ cuda：1”之间的差异。这是有道理的，因为这些部分没有并行化。

我知道我可以将F.cross_entropy移至模型界面。

但是L2正则化损失怎么办？

如何使用pytorch并行化多个GPU上的正则化损失计算？

0 个答案: