有问题的Dask工作程序节点悄无声息地无法加入集群

时间:2020-09-16 17:23:22

标签: python ssh cluster-computing dask dask-distributed

我在Dask.distributed中遇到了一个非常奇怪的错误。我有一个试图将Dask用于4个VM的非托管群集。我正在使用SSHCluster对象初始化集群:

from dask.distributed import Client, SSHCluster

cluster = SSHCluster( 
                ['localhost',       # scheduler
                 'localhost',       # worker 0
                 '192.168.80.18',   # worker 1
                 '192.168.80.14',   # worker 2
                 '192.168.80.12'])  # worker 3

client = Client(cluster)

似乎所有四名工人都已无误启动:

distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.80.13:8786
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.80.13:34395'
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.80.14:45773'
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.80.12:45709'
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.80.18:43979'
distributed.deploy.ssh - INFO - distributed.worker - INFO -       Start worker at:  tcp://192.168.80.14:39597
distributed.deploy.ssh - INFO - distributed.worker - INFO -       Start worker at:  tcp://192.168.80.18:34763
distributed.deploy.ssh - INFO - distributed.worker - INFO -       Start worker at:  tcp://192.168.80.12:37999
distributed.deploy.ssh - INFO - distributed.worker - INFO -       Start worker at:  tcp://192.168.80.13:33627

但是,“ 192.168.80.18”从不属于群集的一部分。这是客户端对象报告的内容:

Client

Scheduler: tcp://192.168.80.13:8786
Dashboard: http://192.168.80.13:8787/status

Cluster

Workers: 3
Cores: 12
Memory: 101.19 GB

深入研究调度程序日志,我们可以看到问题节点从未被注册:

distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:  tcp://192.168.80.13:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO - Register worker <Worker 'tcp://192.168.80.14:39597', name: 2, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.80.14:39597
distributed.scheduler - INFO - Register worker <Worker 'tcp://192.168.80.12:37999', name: 3, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.80.12:37999
distributed.scheduler - INFO - Receive client connection: Client-4b17a4cc-f83f-11ea-b0ac-fa163e0984d7
distributed.scheduler - INFO - Register worker <Worker 'tcp://192.168.80.13:33627', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.80.13:33627

此外,问题节点不会在客户端日志中的任何位置显示。

我花了几天的时间对此进行调试,但无济于事。而且,我竭尽所能确保这些VM上的环境完全相同。我不明白该节点如何在没有任何错误的情况下简单地将自身从群集中排除。

请帮助? 预先感谢。

0 个答案:

没有答案