在OpenMPI中设置BTL标志

时间:2013-12-07 04:56:07

标签: amazon-web-services mpi cluster-computing openmpi

我正在尝试对我的OpenMPI安装进行简单的helloworld测试。我在Amazon AWS上设置了一个双节点集群,我正在使用SUSE SLES11 SP3,OpenMPI 1.4.4(有点旧,但我的Linux发行版没有新的二进制文件)。我到了最后一步,我在设置正确的btl标志时遇到了一些问题。

他就是我所能做的:

  • 我可以在节点之间,两个方向进行scp,因此无密码SSH正常启动并正常运行

  • 如果我运行iptables -L,则表示没有防火墙,所以我认为节点之间的通信应该有效。

  • 我可以使用mpicc编译我的helloworld.c程序,我已经确认该脚本在另一个工作集群上正确运行,所以我认为本地路径设置正确,脚本肯定有用。

  • 如果我从主节点执行mpirun,并且只使用主节点,则helloworld正确执行:

    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi
    ip-xxx-xxx-xxx-133: hello world from process 0 of 1
    
  • 如果我从主节点执行mpirun,只使用worker节点,helloworld会正确执行:

    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi
    ip-xxx-xxx-xxx-210: hello world from process 0 of 1
    

现在,我的问题是,如果我尝试在两个节点上运行helloworld,我会收到错误:

ip-xxx-xxx-xxx-133: # mpirun -n 2 -host master,node001 --mca btl openib,self ./helloworldmpi
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[5228,1],0]) is on host: ip-xxx-xxx-xxx-133
  Process 2 ([[5228,1],1]) is on host: node001
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-133:7037] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 7037 on
node ip-xxx-xxx-xxx-133 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-210:5838] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[ip-xxx-xxx-xxx-133:07032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure

最后,如果我遗漏了-mca btl sm,openib,self flag,那么根本没有任何作用。我承认我对这些旗帜的理解几乎为零。然而,网上关于它们的使用的信息非常少。我查看了我的data.conf文件,我不确定列出的所有设备是否实际存在,但-mca标志似乎正在解决大部分问题,因为我至少可以在每个节点上执行在群集中单独。关于我可能做错了什么,或者我可能会看到什么的任何指示都将非常感激。

2 个答案:

答案 0 :(得分:3)

“ - mca btl openib,sm,self”告诉Open MPI哪些传输用于MPI流量。你指定了:

  • openib:InfiniBand或iWARP
  • sm:共享内存
  • self:loopback

据我所知(虽然我没有密切关注AWS),AWS没有InifniBand或iWARP。所以在这里指定openib是没用的。如果将“tcp”添加到逗号分隔列表中,它应该使用TCP,这应该是您想要的。具体来说,“ - mca btl tcp,sm,self”(以逗号分隔的列表排序并不重要)。

话虽如此,默认情况下,Open MPI应该有效地挑选sm,tcp和self - 所以你根本不需要指定“--mca btl tcp,sm,self”。这对我来说有点奇怪,这对你不起作用。

答案 1 :(得分:0)

为了记录,我只需要将-cp添加到-mca btl标志,它现在可以正常工作。

相关问题