如何使用tf.estimator.train_and_evaluate进行评估?

时间:2018-09-26 11:56:32

标签: tensorflow tensorflow-estimator

我正在使用tf.estimator.train_and_evaluate(...)进行分布式培训,由第一名工人担任首席,第二名工人进行评估。集群如下,有8个工人和2 ps。

{
    "cluster": {
        "ps": ["100.77.4.147:61415", "100.77.14.144:52383"],
        "chief": ["100.77.14.144:49606"],
        "worker": ["100.110.22.203:28312", "100.77.4.147:32299", "100.77.4.147:4950", "100.110.22.203:22196", "100.110.22.203:39327", "100.77.14.144:32888", "100.77.4.147:26919"]
    },
    "task": {
        "index": 0,
        "type": "evaluator"
    }
}

其他固定工人指数从0到结束

但是,在运行时会发生错误:

// in the chief node has following errors
CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error
CreateSession failed because worker /job:worker/replica:0/task:2 returned error: Unavailable: OS Error
CreateSession failed because worker /job:worker/replica:0/task:3 returned error: Unavailable: OS Error

然后我检查其他工人,发现错误如下

CreateSession still waiting for response from worker: /job:worker/replica:0/task:5
CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
...

我设置了错误的cluster_spec吗?谢谢

1 个答案:

答案 0 :(得分:0)

更新:

终于可以了。评估人员不应列入工人清单。 仅供参考。