Question

我正在使用Here中的hello world示例，其中每个进程都在打印其进程名称以及MPI_COMM_WORLD等级ID和通信器大小。

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}

我在slurm上以两种不同的方式运行此示例，一次使用srun，一次使用sbatch。

更准确地说：

（1）

srun -N 2 -n 2 mpirun ./a.out

（2）

sbatch testsimple.job

带有文件testsimple.job包含：

#!/bin/bash
#SBATCH -N 2
#SBATCH -n 2
mpirun ./a.out

问题是，至少从我的理解出发，我不了解输出的差异以及配置相似的情况。

输出为：

（1）

Hello world from processor node1, rank 0 out of 2 processors
Hello world from processor node2, rank 1 out of 2 processors
Hello world from processor node2, rank 1 out of 2 processors
Hello world from processor node1, rank 0 out of 2 processors

（2）

Hello world from processor node1, rank 0 out of 2 processors
Hello world from processor node2, rank 1 out of 2 processors

输出（2）符合我的预期，但不使用srun输出（1）。此处srun似乎恰好在每个节点上执行mpirun，并且两个运行都不在同一MPI应用程序中，因此MPI_COMM_WORLD通信器在两个节点上都不相同。而sbatch可以做到这一点。

我不认为这是故意的，所以我唯一的猜测是我对口吃或使用方式的理解有问题。

我认为我需要在应用程序中使用srun，因为它具有低级选项--cpu_bind，而sbatch没有。我认为我需要使用此选项来手动执行heterogeneous job allocation，遵循this指南，版本版本低于17.11。

我的问题是：

您是否在使用slurm或理解这两个命令应该执行的操作时看到明显的错误？还是您认为它可能与slurm配置有关（我对此一无所知，我不是管理员）？
如果问题不明显，您是否还有其他建议使用sbatch处理异构作业？

感谢您的阅读以及您可能提供的任何帮助！

Answer 1

运行srun -N 2 -n 2 mpirun ./a.out导致Slurm在两个节点上分配两个任务，并使每个任务运行mpirun ./a.out，最后创建了四个进程。

您只能运行srun -N 2 -n 2 ./a.out。如果Slurm和MPI库均已正确配置，则应该可以正常工作。

在多个节点上运行带有srun的mpirun可以提供不同的通信器

1 个答案: