大量进程的MPI错误

时间:2017-01-05 16:59:24

标签: c mpi openmpi

我有一个简单的MPI程序,对于少量进程= {10,100和1000}一切都很好,问题是对于10000个进程我有运行时错误。

我在grid5000平台上的4个节点集群上测试了我的代码(在每个节点上:2个CPU Intel Xeon E5-2630 v3,8个内核/ CPU,126GB RAM,5x558GB HDD,186GB SSD,10Gbps以太网) mpi.c文件包含以下代码:

int main(int argc, char *argv[])
{
 int myRank, numProcs;
 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
 MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

if(myRank==0) printf("%d -%d\n",numProcs,myRank);

MPI_Finalize();
return(0);
}

以下文字摘自n = 10 000的消息错误:

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
[grimoire-1.nancy.grid5000.fr][[53314,1],3876][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:1504:init_one_device] error obtaining device context for mlx4_0 errno says Cannot allocate memory
[grimoire-1.nancy.grid5000.fr][[53314,1],348][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:1504:init_one_device] error obtaining device context for mlx4_0 errno says Cannot allocate memory
.
CMA: unable to open RDMA device
CMA: unable to open RDMA device
CMA: unable to open RDMA device
CMA: unable to open RDMA device
.
.
[grimoire-8.nancy.grid5000.fr][[53314,1],9995][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:1504:init_one_device] error obtaining device context for mlx4_0 errno says Cannot allocate memory
[grimoire-1.nancy.grid5000.fr:17190] 1209 more processes have sent help message help-mpi-btl-openib.txt / error in device init
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[53314,1],1560]) is on host: grimoire-1.nancy.grid5000.fr
  Process 2 ([[53314,1],1]) is on host: grimoire-4
  BTLs attempted: openib self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[grimoire-1.nancy.grid5000.fr:17584] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 1560 with PID 17584 on
node grimoire-1.nancy.grid5000.fr exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[grimoire-1.nancy.grid5000.fr:17999] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
.
[grimoire-1.nancy.grid5000.fr:18802] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[grimoire-1.nancy.grid5000.fr:17190] 23 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[grimoire-1.nancy.grid5000.fr:17190] 5 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
[grimoire-1.nancy.grid5000.fr:17190] 18 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure

谢谢。

0 个答案:

没有答案