Question

我已经阅读了其他主题，并且通过使用新的群组ID解决了这个问题，但是我想了解可能导致此问题的原因。

我有一个包含16个分区的主题，我设置了session.timeout.ms = 30000，并且max.poll.interval.ms = 30000000。

我运行我的程序，并按ctrl + c它，所以它没有正常关闭。在我猜了16次后，我陷入了这个重新加入的问题。 session.timeout.ms是心跳超时，因此30秒后它应该让我的消费者正确，我的分区应该＃34;释放＆＃34;对？或者它只是收听我的max.poll.interval.ms？

编辑：我仍然间歇地得到这个错误，当它发生时我必须重新启动所有的消费者。即使我的消费者运行正常，然后他们开始全部陷入重新加入（没有添加/删除消费者），这种情况也会发生。这是一个错误日志，当我尝试连接到新消费者后，当它遇到该状态时，我会尝试连接它：

https://pastebin.com/AXJeSHkp

2017-06-29 17:28:16,215 DEBUG [AbstractCoordinator] - [scheduler-1] - Sending JoinGroup ((type: JoinGroupRequest, groupId=ingestion-matching-kafka-consumer-group-dev1, sessionTimeout=30000, rebalanceTimeout=43200000, memberId=, protocolType=consumer, groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@b45e5583)) to coordinator kafka04-prod01.messagehub.services.us-south.bluemix.net:9093 (id: 2147483644 rack: null)

2017-06-29 17:37:21,261 DEBUG [NetworkClient] - [scheduler-1] - Node 2147483644 disconnected.
2017-06-29 17:37:21,263 DEBUG [ConsumerNetworkClient] - [scheduler-1] - Cancelled JOIN_GROUP request {api_key=11,api_version=1,correlation_id=19,client_id=ingestion-matching-kafka-consumer-dev1} with correlation id 19 due to node 2147483644 being disconnected

这些是我认为相关的第一个也是最后一个消息。以下是我设置的相关超时：

session.timeout.ms=30000
max.poll.interval.ms=43200000    
request.timeout.ms=43205000 # the docs said to keep this higher than max.poll.interval.ms
enable.auto.commit=false

我也应该设置heartbeat.interval.ms吗？这是消费者在一些后台线程中自动将心跳发送给代理的时间间隔（我已阅读过文档，但出于某种原因，我无法完全理解它）？

Answer 1

If your client does not disconnect properly (crash or SIGINT), it will take session.timeout.ms (30 seconds in your case) for the server to kick it from the group. During this time, the server will still think the consumer is part of the group, so it will not do any reassignments. Once this delay is over, assigned partitions will be reassigned to other consumers (if any).

This of course does not happen if you use a new group ID. While it's tempting to use a new group everytime when developing (as you don't have to wait) you lose any committed offsets by the previous group and this might not represent the state your app will be in while running in production.

Regarding max.poll.interval.ms, it's the maximum delay allowed between 2 calls to poll() in your consumer logic. I don't think this setting is relevant to this question.

Answer 2

我知道这是一个很老的问题，但是我也遇到了类似的问题，最后我了解了这种情况的原因并希望分享。

重新平衡开始时，Kafka等待组中的所有使用者进行poll（）并发送joinGroup请求。重新平衡超时等于max.poll.interval.ms。因此，Kafka会等到每个用户重新平衡超时或过程结束。

在您的情况下，您将max.poll.interval.ms设置为12小时。唯一合理的理由是您必须经过一个漫长的过程。因此，当重新平衡开始时，Kafka将等到您的过程完成或经过12个小时。这就是为什么您的消费者似乎陷入困境的原因。

消费者坚持重新加入

2 个答案: