尝试在slurm管理的群集上连接dask.distributed客户端时出现超时错误

时间:2017-09-04 15:16:12

标签: dask slurm dask-distributed

我已经通过slurm(使用dask.distributed)在slurm管理的群集上的多个核心上启动了dask-mpi群集。所有进程似乎已经启动OK(slurm日志文件中看起来很正常的stdout),但是当我尝试使用client = Client(scheduler_file='/path/to/my/scheduler.json')从python中连接客户端时,我得到一个超时错误,如下所示:

distributed.utils - ERROR - Timed out trying to connect to 'tcp://141.142.181.102:8786' after 5 s: connect() didn't finish in time
Traceback (most recent call last):
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
    quiet_exceptions=EnvironmentError)
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
tornado.gen.TimeoutError: Timeout

这些是发布后scheduler.json的内容。我不知道在这里列出工作流程是否正常,或者这是否表示设置存在问题:

{
  "type": "Scheduler",
  "id": "Scheduler-d0f65756-1b50-43a6-a044-93e4ef047ab7",
  "address": "tcp://141.142.181.102:8786",
  "services": {
    "bokeh": 8787
  },
  "workers": {}
}

我在两个不同的slurm管理集群上遇到了同样的问题。看起来我需要指定特定于端口的东西吗?如果是这样,我该如何确定需要使用哪些端口?

0 个答案:

没有答案