Chronos框架与Mesos连接和断开连接

时间:2016-09-15 12:43:12

标签: docker mesos

这是我的第一个问题,我希望能够正确地做到这一切。

我在不同的主机上有3个码头工具,包括zookeeper,mesos和chronos。 Mesos slave正确订阅了master。 Chronos任务与每个主机同步。

问题是:chronos框架正在连接和断开连接:

0915 12:12:11.132375    49 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
I0915 12:12:11.132647    49 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [  ]
I0915 12:12:11.133229    49 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:12:11.133322    49 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
E0915 12:12:11.133745    55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
I0915 12:12:25.648849    52 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
I0915 12:12:25.649029    52 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [  ]
I0915 12:12:25.649060    52 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:12:25.649116    52 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
E0915 12:12:25.649433    55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
I0915 12:13:15.146510    50 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740   
I0915 12:13:15.146759    50 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [  ]
I0915 12:13:15.146848    50 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:13:15.146939    50 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
E0915 12:13:15.147408    55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected
I0915 12:14:04.957185    51 master.cpp:2231] Received SUBSCRIBE call for framework 'chronos-2.4.0' at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
I0915 12:14:04.957341    51 master.cpp:2302] Subscribing framework chronos-2.4.0 with checkpointing enabled and capabilities [  ]
I0915 12:14:04.957363    51 master.cpp:2312] Framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740 already subscribed, resending acknowledgement
W0915 12:14:04.957392    51 master.hpp:1764] Master attempted to send message to disconnected framework 71c69a28-ef16-4ed1-b869-04df66f84b5d-0000 (chronos-2.4.0) at scheduler-e6ebc7bc-8edb-45e9-ad68-3fa36566b55b@10.xxx.xxx.xxx:61740
E0915 12:14:04.957844    55 process.cpp:1958] Failed to shutdown socket with fd 41: Transport endpoint is not connected 

在这种情况下,mesos-master和chronos框架在同一个docker中,但我怀疑无法连接到Chronos端口61740(这是一个短暂的端口)

netstat capture:

enter image description here

tcpdump capture:

root@HOSTNAME:/# tcpdump -i eth0 port 61740 -v
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
12:30:41.013731 IP (tos 0x0, ttl 64, id 12013, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.29468 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0xa894), seq 1155265525, win 14600, options [mss 1460,sackOK,TS val 852942104 ecr 0,nop,wscale 6], len                    gth 0
12:30:41.013780 IP (tos 0x0, ttl 64, id 49727, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.29468: Flags [R.], cksum 0x595a (correct), seq 0, ack 1155265526, win 0, length 0
12:31:18.129849 IP (tos 0x0, ttl 64, id 64040, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.30564 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0x97fb), seq 535270461, win 14600, options [mss 1460,sackOK,TS val 852979221 ecr 0,nop,wscale 6], leng                    th 0
12:31:18.129892 IP (tos 0x0, ttl 64, id 6441, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.30564: Flags [R.], cksum 0xd9be (correct), seq 0, ack 535270462, win 0, length 0
12:31:36.451417 IP (tos 0x0, ttl 64, id 21303, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.31103 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0x10c7), seq 186377873, win 14600, options [mss 1460,sackOK,TS val 852997542 ecr 0,nop,wscale 6], leng                    th 0
12:31:36.451470 IP (tos 0x0, ttl 64, id 13169, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.31103: Flags [R.], cksum 0x9a1b (correct), seq 0, ack 186377874, win 0, length 0
12:31:41.619076 IP (tos 0x0, ttl 64, id 41997, offset 0, flags [DF], proto TCP (6), length 60)
172.xxx.xxx.xxx.31252 > HOSTNAME.61740: Flags [S], cksum 0xb989 (incorrect -> 0xfe18), seq 2176478683, win 14600, options [mss 1460,sackOK,TS val 853002710 ecr 0,nop,wscale 6], length 0
12:31:41.619119 IP (tos 0x0, ttl 64, id 13179, offset 0, flags [DF], proto TCP (6), length 40)
HOSTNAME.61740 > 172.xxx.xxx.xxx.31252: Flags [R.], cksum 0x9b9d (correct), seq 0, ack 2176478684, win 0, length 0

IP 172.xxx.xxx.xxx是容器IP,但我实际上是这样运行mesos-master:

mesos-master --log_dir=/var/log/mesos/master/ --work_dir=/var/log/mesos/work/ --quorum=2 --cluster=XXXX --zk=file:///etc/mesos/zk --advertise_ip=10.XXX.XXX.XXX --hostname=HOSTNAME

任何想法或建议都将受到赞赏。

感谢。

1 个答案:

答案 0 :(得分:0)

在tcpdump catpure中,我们可以看到错误的校验和。它似乎是内核版本(3.10)中的一个错误。这修复了3.14+,但我无法检查,因为我们无法在这个环境中更新。

https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.w6eui9yc9