如何为Torque作业增加OpenFabrics内存限制?

时间:2013-07-19 21:02:07

标签: linux mpi ulimit torque ofed

当我通过InfiniBand运行MPI工作时,我得到以下工作。我们使用扭矩管理器。

--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:              host1

Registerable memory:     65536 MiB

Total memory:            196598 MiB

Your MPI job will continue, but may be behave poorly and/or hang.

--------------------------------------------------------------------------

我已经阅读了警告信息上的链接,到目前为止我已经完成了;

  1. options mlx4_core log_num_mtt=20 log_mtts_per_seg=4上附加/etc/modprobe.d/mlx4_en.conf
  2. 确保在/etc/security/limits.conf上写下以下行
    • * soft memlock unlimited
    • * hard memlock unlimited
  3. session required pam_limits.so
  4. 上附加/etc/pam.d/sshd
  5. 确保ulimit -c unlimited
  6. 取消注释/etc/init.d/pbs_mom

    任何人都可以帮我找出我错过的东西吗?

1 个答案:

答案 0 :(得分:3)

您的mlx4_core参数仅允许注册2^20 * 2^4 * 4 KiB = 64 GiB。如果每个节点有192 GiB的物理内存,并且建议至少有两倍的可注册内存,则应将log_num_mtt设置为23,这会将限制增加到512 GiB - 最接近2的幂或等于RAM量的两倍。确保重新启动节点或卸载,然后重新加载内核模块。

您还应提交一个执行ulimit -l的简单Torque作业脚本,以验证锁定内存的限制并确保没有此限制。请注意,ulimit -c unlimited不会删除锁定内存量的限制,而是取消核心转储文件大小的限制。