Zookeeper会话丢失,未触发任何事件且未启用任何交换

时间:2019-04-02 14:41:44

标签: apache-zookeeper apache-curator

我们的进程在Linux系统上运行,该系统几乎使用了TB的RAM,未启用任何交换。

发生的事情是我们的进程由于某种原因而冻结了一段时间,原因是我无法弄清,因此Zookeeper会按期终止我们的会话,然后该进程恢复活动,日志中未显示任何触发的事件。

我们遇到了类似的情况,但是当我们的流程恢复正常时,就会触发连接丢失和会话过期事件,因此我们可以通过在Zookeeper上重新创建该流程的关联临时节点来处理这种情况。我们认为这是由于整个GC周期造成的。

现在的新功能是该过程冻结,但是在重新启动后没有触发任何事件!因此,无法检测到我们的会话已过期。

我正在考虑仅监视我们的临时节点是否已删除,然后重新创建它。但是我想知道这是否是正确的选择,因为我仍然不知道为什么该过程最初会冻结。

增加会话超时不是一种选择,因为它对我们来说已经太高了。而且我们还是试图处理会话超时。

所以我的问题很简单:

  1. 除了完整的GC周期外,还有其他原因吗?
  2. 为什么我们的流程重新联机后不触发断开连接或会话过期事件?
  3. 是否通常监视应用程序的临时节点的删除,而不依赖于事件呢?

编辑 在增加Zookeeper的日志记录详细信息后,我发现了一些非常有趣的东西

DEBUG: [07:05:57] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:06:31] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:07:04] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:07:37] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:08:11] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:08:44] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:09:17] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:09:51] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:10:24] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:10:57] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:11:31] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:12:04] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:12:38] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:13:11] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:13:44] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:14:18] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]

仔细观察,您会发现每个日志之间的时间差约为33秒。在我的计算机上时,日志消息每隔约1秒钟显示一次。这可能是由于网络延迟造成的吗?

编辑

Running the mntr command returned the following stats
zk_version    3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 04:05 GMT
zk_avg_latency    0
zk_max_latency    17657
zk_min_latency    0
zk_packets_received    1427134
zk_packets_sent    1596974
zk_num_alive_connections    64
zk_outstanding_requests    0
zk_server_state    follower
zk_znode_count    1394
zk_watch_count    592
zk_ephemerals_count    192
zk_approximate_data_size    181257
zk_open_file_descriptor_count    94
zk_max_file_descriptor_count    1048576
zk_fsync_threshold_exceed_count    1

我发现 zk_max_latency 值非常高。我想知道这是一种什么样的延迟?如何调试该值的原因?

0 个答案:

没有答案
相关问题