当Kafka经纪人崩溃并回来时,kafka生产者中的数据丢失

时间:2019-06-13 05:01:50

标签: apache-kafka kafka-producer-api

每当Kafka经纪人崩溃并重新加入时,我都面临着一些数据丢失。我猜想,只要代理加入集群,就会触发重新平衡,这时我发现我的Kafka Producer中有些错误。

生产者写了一个有40个分区的Kafka主题,以下是每当触发重新平衡时看到的日志顺序。

[WARN ] 2019-06-05 20:39:08 WARN  Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133054 on topic-partition test_ve-17, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION
...
...
[WARN ] 2019-06-05 20:39:31 WARN  Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133082 on topic-partition test_ve-12, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
...
...
[ERROR] 2019-06-05 20:39:43 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.
...
...
[WARN ] 2019-06-05 20:39:48 WARN  Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133094 on topic-partition test_ve-22, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
[ERROR] 2019-06-05 20:39:53 ERROR Sender:604 - [Producer clientId=producer-1] The broker returned org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number for topic-partition test_ve-37 at offset -1. This indicates data loss on the broker, and should be investigated.
[INFO ] 2019-06-05 20:39:53 INFO  TransactionManager:372 - [Producer clientId=producer-1] ProducerId set to -1 with epoch -1
[ERROR] 2019-06-05 20:39:53 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number
...
...
RROR] 2019-06-05 20:39:53 ERROR GlobalsKafkaProducer:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: Attempted to retry sending a batch but the producer id changed from 417002 to 418001 in the mean time. This batch will be dropped.

我们拥有的一些Kafka配置是

acks = all
min.insync.replicas=2
unclean.leader.election.enable=false
linger.ms=250
retries = 3

每产生3000条记录,我就调用 flush()。我在做错什么吗,请问指针吗?

1 个答案:

答案 0 :(得分:1)

让我假设一些事情,您有3个Kafka代理节点,并且所有主题的复制因子也都是3。您不会即时创建主题。

如您所给:

acks = all
min.insync.replicas=2
unclean.leader.election.enable=false

在这种情况下,如果两个同步副本都关闭,则肯定会删除数据。由于unclean.leader.election.enable=false且最后一个副本没有资格被选为群集的领导者,因此没有领导者可以接收发送请求。由于您设置了linger.ms= 250,因此其中一个异步副本在短时间内恢复了活动状态,并再次被选为主题领导者,因此可以避免数据丢失。但需要注意的是linger.msbatch.size一起工作。如果您为batch.size设置了非常低的值,并且要发送的邮件数量达到了批处理大小,那么生产者可能不会等到linger.ms设置。

因此,我建议的明确更改之一是增加retries。检查配置中的参数request.timeout.ms。查找您的经纪人在关机后平均回来的平均时间。如果存在因果关系,您的重试应包括经纪人恢复生存所花费的时间。如果能够采取其他所有折衷措施来减少数据丢失的机会,那么这绝对会帮助您避免数据丢失。