Question

我在正确设置系统时遇到了一些麻烦。我的系统包括：

英特尔NUC7i7BNH
带有NVidia 1080ti的华硕ROG

lspci正确检测到我的GPU：

06:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. Device 85ea
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 128 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at a0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at b0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 2000 [size=128]
        Expansion ROM at c5000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

以及下面的命令，我为GPU，CUDA和cuDNN安装了驱动程序：

# Show thunderbolt port / authorize eGPU
$ cat /sys/bus/thunderbolt/devices/0-1/device_name
$ echo 1 | sudo tee -a /sys/bus/thunderbolt/devices/0-1/authorized 

# eGPU - on Ubuntu 16.04 - nvidia-384
$ sudo ubuntu-drivers devices
$ sudo ubuntu-drivers autoinstall
$ sudo apt-get install nvidia-modprobe

# CUDA - Download CUDA from Nvidia - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run
$ chmod +x cuda_9.0.176_384.81_linux-run
$ ./cuda_9.0.176_384.81_linux-run --extract=$HOME
$ sudo ./cuda-linux.9.0.176-22781540.run
$ sudo ./cuda-samples.9.0.176-22781540-linux.run
$ sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
$ sudo ldconfig
# add :/usr/local/cuda/bin (including the ":") at the end of the PATH="/blah:/blah/blah" string (inside the quotes).
$ sudo nano /etc/environments

# CUDA samples - Check installation
$ cd /usr/local/cuda-9.0/samples
$ sudo make
$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery

# cuDNN
$ wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb
$ dpkg -i libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb

# Authorization (Security risk!)
$ sudo nano /etc/udev/rules.d/99-local.rules
# Add
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1"

借助sudo prime-select nvidia，我可以在eGPU和Intel集成的GPU之间切换。注销+登录后，eGPU似乎可以工作（nvidia-smi）：

Fri Sep 21 09:25:18 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   47C    P0    84W / 275W |    214MiB / 11172MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6734      G   /usr/lib/xorg/Xorg                           138MiB |
|    0      7081      G   kwin_x11                                      23MiB |
|    0      7084      G   /usr/bin/krunner                               2MiB |
|    0      7092      G   /usr/bin/plasmashell                          47MiB |
+-----------------------------------------------------------------------------+

CUDA示例也可以工作：

$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11172 MBytes (11715084288 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1683 MHz (1.68 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes                                                                                                                              
  Total amount of shared memory per block:       49152 bytes                                                                                                                              
  Total number of registers available per block: 65536                                                                                                                                    
  Warp size:                                     32                                                                                                                                       
  Maximum number of threads per multiprocessor:  2048                                                                                                                                     
  Maximum number of threads per block:           1024                                                                                                                                     
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)                                                                                                                          
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)                                                                                                                
  Maximum memory pitch:                          2147483647 bytes                                                                                                                         
  Texture alignment:                             512 bytes                                                                                                                                
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)                                                                                                                
  Run time limit on kernels:                     Yes                                                                                                                                      
  Integrated GPU sharing Host Memory:            No                                                                                                                                       
  Support host page-locked memory mapping:       Yes                                                                                                                                      
  Alignment requirement for Surfaces:            Yes                                                                                                                                      
  Device has ECC support:                        Disabled                                                                                                                                 
  Device supports Unified Addressing (UVA):      Yes                                                                                                                                      
  Supports Cooperative Kernel Launch:            Yes                                                                                                                                      
  Supports MultiDevice Co-op Kernel Launch:      Yes                                                                                                                                      
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0                                                                                                                                
  Compute Mode:                                                                                                                                                                           
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >                                                                                             

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1                                                                                     
Result = PASS

我们来解决这个问题：仅当我以root权限执行python时，以下代码段才有效。

$ python                                                                                                
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)                                                                       
[GCC 7.2.0] on linux                                                                                                                 
Type "help", "copyright", "credits" or "license" for more information.                                                               
>>> import torch                                                                                                                     
>>> torch.cuda.is_available()                                                                                                        
True                                                                                                                                 
>>> torch.Tensor([0]).cuda()                                                                                                         
Traceback (most recent call last):                                                                                                   
  File "<stdin>", line 1, in <module>                                                                                                
RuntimeError: CUDA error: unknown error

vs。

$ sudo ~/miniconda3/envs/PyTorch/bin/python                                                             
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51)                                                                       
[GCC 7.2.0] on linux                                                                                                                 
Type "help", "copyright", "credits" or "license" for more information.                                                               
>>> import torch                                                                                                                     
>>> torch.cuda.is_available()                                                                                                        
True                                                                                                                                 
>>> torch.Tensor([0]).cuda()                                                                                                         
tensor([0.], device='cuda:0')

TLDR：我的CUDA安装似乎可以正常工作，并且我的eGPU已被系统正确识别。但是在PyTorch（以及Tensorflow）中，当我尝试将相同数据移至eGPU时，尽管框架已识别出该错误，但会抛出“未知错误”。当我以root权限执行相同的代码时，所有内容都像一个超级按钮。有人知道在哪里我必须采用一些权限，以便可以作为标准用户执行代码？

Answer 1

试用了不同版本的nvidia驱动程序后，我得出的结论是384版存在问题。

我通过sudo apt-get purge nvidia.删除了所有与nvidia驱动程序相关的内容，并安装了390版：

$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo ubuntu-drivers devices
$ sudo apt install nvidia-390

结果如下：

(PyTorch) max@MaxKubuntuNUC:~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.tensor([1]).cuda()
tensor([1], device='cuda:0')

现在，所有内容都像魅力一样。

NVidia 1080ti eGPU Ubuntu 16.04.5 LTS-没有root权限的PyTorch / Tensorflow

1 个答案: