Tensorflow cifar 10示例内存泄漏

时间:2017-05-12 21:50:51

标签: memory-leaks tensorflow

我是Tensorflow的新手。我试着从这里运行cifar10示例: https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10

我没有对代码进行任何更改,我只是尝试在多个GPU上运行它。我正在尝试使用6个GPU,我正在为我的工作分配10 GB的RAM,但几分钟后我的工作由于内存限制而失败。分配更多内存无济于事,只是延迟了错误。我尝试了高达40GB的内存。

以下是有关我的系统的更多信息:

  

== cat / etc / issue ======================================== ======= Linux mmmdgx01 4.4.0-45-generic#66~14.04.1-Ubuntu SMP Wed Oct 19   15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU / Linux DGX_OTA_VERSION = 2.0.5   VERSION =" 14.04.5 LTS,Trusty Tahr" VERSION_ID =" 14.04"

     

==我们在码头工人========================================= ====否

     

==编译器============================================ ========= c ++(Ubuntu 4.8.4-2ubuntu1~14.04.3)4.8.4版权所有(C)2013自由软件   Foundation,Inc。这是免费软件;看到复制的来源   条件。没有保修;甚至没有适销性或   适合特定目的。

     

== uname -a ========================================== =========== Linux mmmdgx01 4.4.0-45-generic#66~14.04.1-Ubuntu SMP Wed Oct 19   15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU / Linux

     

==检查点数=========================================== ======== numpy(1.11.1)protobuf(3.2.0)tensorflow(1.1.0rc1)

     

==检查virtualenv =========================================假

     

== tensorflow import =========================================== = tf.VERSION = 1.1.0-rc1 tf.GIT_VERSION = v1.1.0-rc1-272-gf77f19b   tf.COMPILER_VERSION = v1.1.0-rc1-272-gf77f19b完整性检查:数组(1,   D型= INT32)

     

== env ============================================ ============== LD_LIBRARY_PATH   /opt/sw/cuda/8.0/lib64/:/project/DGX/cuda/lib64/:/opt/sw/cuda/8.0/extras/CUPTI/lib64/:/project/DGX/lib   DYLD_LIBRARY_PATH   /项目/ DGX /炬/安装/ lib中:/项目/ torch7new /安装/ lib中:

     

== nvidia-smi ========================================== ========= 2017年5月12日星期五15:46:50   + ------------------------------------------------- ---------------------------- + | NVIDIA-SMI 375.20驱动程序版本:375.20
  |   | ------------------------------- + ----------------- ----- + ---------------------- + | GPU名称持久性-M | Bus-Id Disp.A |挥发物   不可校正。 ECC | | Fan Temp Perf Pwr:用法/上限|内存使用|   GPU-Util Compute M. |   | =============================== + ================= ===== + ====================== | | 0特斯拉P100-SXM2 ...开| 0000:06:00.0关|
  0 | | N / A 34C P0 42W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 1特斯拉P100-SXM2 ......开| 0000:07:00.0关闭|
  0 | | N / A 32C P0 32W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 2特斯拉P100-SXM2 ......开| 0000:0A:00.0关闭|
  0 | | N / A 34C P0 33W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 3特斯拉P100-SXM2 ......开| 0000:0B:00.0关|日   0 | | N / A 33C P0 32W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 4特斯拉P100-SXM2 ......开| 0000:85:00.0关|
  0 | | N / A 33C P0 30W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 5特斯拉P100-SXM2 ......开| 0000:86:00.0关闭|
  0 | | N / A 33C P0 33W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 6特斯拉P100-SXM2 ......开| 0000:89:00.0关闭|
  0 | | N / A 31C P0 32W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + ---------------------- + | 7特斯拉P100-SXM2 ......开| 0000:8A:00.0关闭|
  0 | | N / A 35C P0 32W / 300W | 0MiB / 16308MiB | 0%
  默认|   + ------------------------------- + ----------------- ----- + + ----------------------

     

+ ---------------------------------------------- ------------------------------- + |进程:GPU   记忆| | GPU PID类型进程名称
  用法|   | ================================================= ============================ | |没有找到正在运行的流程   |   + ------------------------------------------------- ---------------------------- +

     

== cuda libs =========================================== ======

这是我的工作提交脚本:

#! /bin/bash
#SBATCH --account=AI
#SBATCH --time=167:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH -J TFImgNet
#SBATCH -e tf.err
#SBATCH -o tf.log
#SBATCH --mem=10960
#SBATCH --gres=gpu:6
cpath=$(pwd)
cd ~
source .bashrc
cd $cpath
which python
python cifar10_multi_gpu_train.py --num_gpus 6

这是错误:

2017-05-12 15:14:07.162709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3 4 5
2017-05-12 15:14:07.162718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y Y Y Y N
2017-05-12 15:14:07.162721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y Y Y N Y
2017-05-12 15:14:07.162724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2:   Y Y Y Y N N
2017-05-12 15:14:07.162727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3:   Y Y Y Y N N
2017-05-12 15:14:07.162729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 4:   Y N N N Y Y
2017-05-12 15:14:07.162732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 5:   N Y N N Y Y
2017-05-12 15:14:07.162743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
2017-05-12 15:14:07.162747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:07:00.0)
2017-05-12 15:14:07.162751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla P100-SXM2-16GB, pci bus id: 0000:0a:00.0)
2017-05-12 15:14:07.162754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla P100-SXM2-16GB, pci bus id: 0000:0b:00.0)
2017-05-12 15:14:07.162756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
2017-05-12 15:14:07.162759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla P100-SXM2-16GB, pci bus id: 0000:86:00.0)
slurmstepd: error: Job 1313520 exceeded memory limit (11240536 > 11223040), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 1313520 ON mmmdgx01 CANCELLED AT 2017-05-12T15:28:58 ***

0 个答案:

没有答案
相关问题