使用cuda-9.0进行Theano分段故障

时间:2017-08-11 02:16:28

标签: theano theano-cuda

我尝试在P100节点上安装和使用Theano和Cuda-9.0。安装本身很顺利,但我得到了分段错误(见下文)。

我尝试将Theano-0.9.0和Theano-0.10.0beta1与libgpuarray / pygpu-0.6.8和0.6.9结合使用。所有这些案件都会导致段错误。

这是我的设置: * RHEL 7 *海湾合作委员会:4.8.5 * CUDA:9.0 * cuDNN:5.1.5 * Python:2.7.13 * cmake:3.7.2

[bsankara@c460 ~]$ python
Python 2.7.13 (default, Aug 10 2017, 07:33:11)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[52508,1],0] (PID 3946)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[c460:03946] *** Process received signal ***
[c460:03946] Signal: Segmentation fault (11)
[c460:03946] Signal code: Invalid permissions (2)
[c460:03946] Failing at address: 0x3fff8d48f5b0
[c460:03946] [ 0] [0x3fff9cdf0478]
[c460:03946] [ 1] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(load_libcuda+0x60)[0x3fff8631b5e0]
[c460:03946] [ 2] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x3f384)[0x3fff862df384]
[c460:03946] [ 3] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x41118)[0x3fff862e1118]
[c460:03946] [ 4] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(gpucontext_init+0x90)[0x3fff862c7930]
[c460:03946] [ 5] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x2c974)[0x3fff8638c974]
[c460:03946] [ 6] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x101050)[0x3fff9cc61050]
[c460:03946] [ 7] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x54318)[0x3fff863b4318]
[c460:03946] [ 8] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x56530)[0x3fff863b6530]
[c460:03946] [ 9] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554]
[c460:03946] [10] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8e64)[0x3fff9ccc9484]
[c460:03946] [11] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [12] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524]
[c460:03946] [13] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [14] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524]
[c460:03946] [15] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [16] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484]
[c460:03946] [17] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960]
[c460:03946] [18] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x188e50)[0x3fff9cce8e50]
[c460:03946] [19] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18ad54)[0x3fff9ccead54]
[c460:03946] [20] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18a540)[0x3fff9ccea540]
[c460:03946] [21] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x2f4)[0x3fff9cceb7b4]
[c460:03946] [22] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x15d038)[0x3fff9ccbd038]
[c460:03946] [23] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554]
[c460:03946] [24] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyObject_Call+0x74)[0x3fff9cbc1ab4]
[c460:03946] [25] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x68)[0x3fff9ccbfc68]
[c460:03946] [26] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3214)[0x3fff9ccc3834]
[c460:03946] [27] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360]
[c460:03946] [28] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484]
[c460:03946] [29] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960]
[c460:03946] *** End of error message ***
Segmentation fault

任何帮助将不胜感激。感谢。

2 个答案:

答案 0 :(得分:0)

从网上获取一个演示mpi c ++或c代码,并用mpicc / mpic ++编译它。检查编译器是否正常工作以及您创建的可执行文件是否可以运行,并且可以管理群集中不同节点之间的点对点通信。

您可能使用了错误的mpicc来编译theano,并且该编译器与inifiniband(或连接群集中的计算机的任何硬件)的库没有二进制兼容性。

例如,如果InfiniBand库是由gcc编译的,而theano是由基于intel编译器的mpicc编译的,那么它将无法工作。

您可以设置一个环境变量来要求openmpi的mpicc使用另一个编译器。

如果您在该计算机上有不同编译器编译的多个mpi实现...尝试使用ldd找出哪个共享库对象(那些.so文件)取决于哪个。

最好的情况当然是使用相同的编译器和相同的mpi包装器来编译器编译所有内容,并将文件包装成几个modules

答案 1 :(得分:0)

答案变成了gcc版本和libgpuarray。出于某种原因,gcc-4.8.5存在libgpuarray的问题,以及导致分段错误的原因。

我在我的用户空间中安装了gcc-5.4.0并重新编译了cmake和libgpuarray以及包括theano和numpy在内的其他内容(只是为了确定)然后它不再有Segmentation故障。

另一个变化是群集管理员使用新驱动程序384.66将CUDA更新为9.0.151

相关问题