Question

我在slurm.conf中配置了每个节点

NodeName=node1 NodeAddr=xxx.xxx.xxx.xxx   State=UNKNOWN Procs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=128000  TmpDisk=65536

当我运行以下命令时

srun -n 2 sleep 60

我发现该作业将分配节点中的所有核心。如果另一个作业要在此节点上运行，则将取消该作业，直到上一个作业完成为止。

scontrol显示职位信息如下

JobId=51 JobName=sleep
UserId=hadoop(1002) GroupId=hadoop(1002) MCS_label=N/A
Priority=4294901703 Nice=0 Account=hadoop QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:12 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2018-07-16T21:46:56 EligibleTime=2018-07-16T21:46:56
StartTime=2018-07-16T21:46:56 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-07-16T21:46:56
Partition=TOTAL AllocNode:Sid=node1:25124
ReqNodeList=(null) ExcNodeList=(null)
NodeList=xxx.xxx.xxx
BatchHost=xxx.xxx.xxx
NumNodes=1 NumCPUs=32 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=125G,node=1,billing=32
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=sleep
WorkDir=/home/hadoop
Power=

使用sacct获取历史记录作业，我得到以下输出

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
       51       sleep      TOTAL     hadoop         32    COMPLETED  0:0
       51.0     sleep                hadoop          2    COMPLETED  0:0

显示分区信息：

  PartitionName=TOTAL
  AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
  AllocNodes=ALL Default=YES QoS=N/A
  DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 
  Hidden=NO
  MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO 
  MaxCPUsPerNode=UNLIMITED
  Nodes=xxxxxxx
  PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
  OverTimeLimit=NONE PreemptMode=OFF
  State=UP TotalCPUs=96 TotalNodes=3 SelectTypeParameters=NONE
  DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

似乎有问题。

Answer 1

这是SelectType引起的问题。我将其作为默认值，我认为它是select / linear。如Select Plugin Design Guide中所述，select / linear是以节点为中心的。

select / linear和select / cons_res插件具有类似的操作模式。明显的区别是，select / linear中的数据结构是以节点为中心的，而select / cons_res中的数据结构包含的分辨率更高（套接字，内核，线程或CPU取决于SelectTypeParameters配置参数）。

我将SelectType更改为select / cons_res并重新启动整个集群，问题得以解决。

slurm：节点中的所有cpus由仅需要cpus子集的作业分配

1 个答案: