Question

我有一个3节点的Akka群集，该群集的每个节点上都运行着3个参与者。群集可以正常运行大约2个小时，但是2个小时后，我收到以下警告：

[INFO] [06/07/2018 15:08:51.923] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://ClusterSystem@192.168.2.8:2552/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FClusterSystem%40192.168.2.7%3A2552-112] No response from remote for outbound association. Handshake timed out after [15000 ms].

[WARN] [06/07/2018 15:08:51.923] [ClusterSystem-akka.remote.default-remote-dispatcher-18] [akka.tcp://ClusterSystem@192.168.2.8:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.2.7%3A2552-8] Association with remote system [akka.tcp://ClusterSystem@192.168.2.7:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem@192.168.2.7:2552]] Caused by: [No response from remote for outbound association. Handshake timed out after [15000 ms].]

[WARN] [06/07/2018 16:07:06.347] [ClusterSystem-akka.actor.default-dispatcher-101] [akka.remote.PhiAccrualFailureDetector@3895fa5b] heartbeat interval is growing too large: 2839 millis

编辑：来自API的Akka CLuster Managemant响应

{
  "selfNode": "akka.tcp://ClusterSystem@127.0.0.1:2551",
  "leader": "akka.tcp://ClusterSystem@127.0.0.1:2551",
  "oldest": "akka.tcp://ClusterSystem@127.0.0.1:2551",
  "unreachable": [
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2552",
      "observedBy": [
        "akka.tcp://ClusterSystem@127.0.0.1:2551",
        "akka.tcp://ClusterSystem@127.0.0.1:2560"
      ]
    }
  ],
  "members": [
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2551",
      "nodeUid": "105742380",
      "status": "Up",
      "roles": [
        "Frontend",
        "dc-default"
      ]
    },
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2552",
      "nodeUid": "-150160059",
      "status": "Up",
      "roles": [
        "RuleExecutor",
        "dc-default"
      ]
    },
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2560",
      "nodeUid": "-158907672",
      "status": "Up",
      "roles": [
        "RuleExecutor",
        "dc-default"
      ]
    }
  ]
}

**编辑1：**群集设置配置和故障检测器配置

cluster {
      jmx.multi-mbeans-in-same-jvm = on
      roles = ["Frontend"]
      seed-nodes = [
        "akka.tcp://ClusterSystem@192.168.2.9:2551"]
      auto-down-unreachable-after = off

      failure-detector {

        # FQCN of the failure detector implementation.
        # It must implement akka.remote.FailureDetector and have
        # a public constructor with a com.typesafe.config.Config and
        # akka.actor.EventStream parameter.
        implementation-class = "akka.remote.PhiAccrualFailureDetector"

        # How often keep-alive heartbeat messages should be sent to each connection.
        # heartbeat-interval = 10 s

        # Defines the failure detector threshold.
        # A low threshold is prone to generate many wrong suspicions but ensures
        # a quick detection in the event of a real crash. Conversely, a high
        # threshold generates fewer mistakes but needs more time to detect
        # actual crashes.
        threshold = 18.0

        # Number of the samples of inter-heartbeat arrival times to adaptively
        # calculate the failure timeout for connections.
        max-sample-size = 1000

        # Minimum standard deviation to use for the normal distribution in
        # AccrualFailureDetector. Too low standard deviation might result in
        # too much sensitivity for sudden, but normal, deviations in heartbeat
        # inter arrival times.
        min-std-deviation = 100 ms

        # Number of potentially lost/delayed heartbeats that will be
        # accepted before considering it to be an anomaly.
        # This margin is important to be able to survive sudden, occasional,
        # pauses in heartbeat arrivals, due to for example garbage collect or
        # network drop.
        acceptable-heartbeat-pause = 15 s

        # Number of member nodes that each member will send heartbeat messages to,
        # i.e. each node will be monitored by this number of other nodes.
        monitored-by-nr-of-members = 2

        # After the heartbeat request has been sent the first failure detection
        # will start after this period, even though no heartbeat message has
        # been received.
        expected-response-after = 10 s

      }

    }

没有远程对出站关联的响应。 Akka群集中的[15000 ms]错误后握手超时

0 个答案: