Question

我是Tensorflow的新手。我试着从这里运行cifar10示例： https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10

我没有对代码进行任何更改，我只是尝试在多个GPU上运行它。我正在尝试使用6个GPU，我正在为我的工作分配10 GB的RAM，但几分钟后我的工作由于内存限制而失败。分配更多内存无济于事，只是延迟了错误。我尝试了高达40GB的内存。

以下是有关我的系统的更多信息：

== cat / etc / issue ======================================== ======= Linux mmmdgx01 4.4.0-45-generic＃66~14.04.1-Ubuntu SMP Wed Oct 19   15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU / Linux DGX_OTA_VERSION = 2.0.5   VERSION =＆＃34; 14.04.5 LTS，Trusty Tahr＆＃34; VERSION_ID =＆＃34; 14.04＆＃34;

==我们在码头工人========================================= ====否

==编译器============================================ ========= c ++（Ubuntu 4.8.4-2ubuntu1~14.04.3）4.8.4版权所有（C）2013自由软件   Foundation，Inc。这是免费软件;看到复制的来源   条件。没有保修;甚至没有适销性或   适合特定目的。

== uname -a ========================================== =========== Linux mmmdgx01 4.4.0-45-generic＃66~14.04.1-Ubuntu SMP Wed Oct 19   15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU / Linux

==检查点数=========================================== ======== numpy（1.11.1）protobuf（3.2.0）tensorflow（1.1.0rc1）

==检查virtualenv =========================================假

== tensorflow import =========================================== = tf.VERSION = 1.1.0-rc1 tf.GIT_VERSION = v1.1.0-rc1-272-gf77f19b   tf.COMPILER_VERSION = v1.1.0-rc1-272-gf77f19b完整性检查：数组（1，   D型= INT32）

== env ============================================ ============== LD_LIBRARY_PATH   /opt/sw/cuda/8.0/lib64/:/project/DGX/cuda/lib64/:/opt/sw/cuda/8.0/extras/CUPTI/lib64/:/project/DGX/lib   DYLD_LIBRARY_PATH   /项目/ DGX /炬/安装/ lib中：/项目/ torch7new /安装/ lib中：

== nvidia-smi ========================================== ========= 2017年5月12日星期五15:46:50   + ------------------------------------------------- ---------------------------- + | NVIDIA-SMI 375.20驱动程序版本：375.20
  |   | ------------------------------- + ----------------- ----- + ---------------------- + | GPU名称持久性-M | Bus-Id Disp.A |挥发物   不可校正。 ECC | | Fan Temp Perf Pwr：用法/上限|内存使用|   GPU-Util Compute M. |   | =============================== + ================= ===== + ====================== | | 0特斯拉P100-SXM2 ...开| 0000：06：00.0关|
  0 | | N / A 34C P0 42W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 1特斯拉P100-SXM2 ......开| 0000：07：00.0关闭|
  0 | | N / A 32C P0 32W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 2特斯拉P100-SXM2 ......开| 0000：0A：00.0关闭|
  0 | | N / A 34C P0 33W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 3特斯拉P100-SXM2 ......开| 0000：0B：00.0关|日   0 | | N / A 33C P0 32W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 4特斯拉P100-SXM2 ......开| 0000：85：00.0关|
  0 | | N / A 33C P0 30W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 5特斯拉P100-SXM2 ......开| 0000：86：00.0关闭|
  0 | | N / A 33C P0 33W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 6特斯拉P100-SXM2 ......开| 0000：89：00.0关闭|
  0 | | N / A 31C P0 32W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 7特斯拉P100-SXM2 ......开| 0000：8A：00.0关闭|
  0 | | N / A 35C P0 32W / 300W | 0MiB / 16308MiB | 0％
  默认|   + ------------------------------- + ----------------- ----- + + ----------------------

+ ---------------------------------------------- ------------------------------- + |进程：GPU   记忆| | GPU PID类型进程名称
  用法|   | ================================================= ============================ | |没有找到正在运行的流程   |   + ------------------------------------------------- ---------------------------- +

== cuda libs =========================================== ======

这是我的工作提交脚本：

#! /bin/bash
#SBATCH --account=AI
#SBATCH --time=167:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH -J TFImgNet
#SBATCH -e tf.err
#SBATCH -o tf.log
#SBATCH --mem=10960
#SBATCH --gres=gpu:6
cpath=$(pwd)
cd ~
source .bashrc
cd $cpath
which python
python cifar10_multi_gpu_train.py --num_gpus 6

这是错误：

2017-05-12 15:14:07.162709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3 4 5
2017-05-12 15:14:07.162718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y Y Y Y N
2017-05-12 15:14:07.162721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y Y Y N Y
2017-05-12 15:14:07.162724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2:   Y Y Y Y N N
2017-05-12 15:14:07.162727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3:   Y Y Y Y N N
2017-05-12 15:14:07.162729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 4:   Y N N N Y Y
2017-05-12 15:14:07.162732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 5:   N Y N N Y Y
2017-05-12 15:14:07.162743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
2017-05-12 15:14:07.162747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
2017-05-12 15:14:07.162751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:0a:00.0)
2017-05-12 15:14:07.162754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:0b:00.0)
2017-05-12 15:14:07.162756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
2017-05-12 15:14:07.162759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla P100-SXM2-16GB, pci bus id: 0000:86:00.0)
slurmstepd: error: Job 1313520 exceeded memory limit (11240536 > 11223040), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 1313520 ON mmmdgx01 CANCELLED AT 2017-05-12T15:28:58 ***

Tensorflow cifar 10示例内存泄漏

0 个答案: