Question

我在AWS上运行3节点Kafka群集。

卡夫卡版： 0.10.2.1
Zookeeper版本： 3.4

在执行一些稳定性测试时，我注意到当我关闭领导节点时，消息会丢失。

以下是重现此问题的步骤：

使用复制因子3创建一个主题，该主题应该使数据在所有3个节点上可用。

~ $ docker run --rm -ti ches/kafka bin/kafka-topics.sh --zookeeper "10.2.31.10:2181,10.2.31.74:2181,10.2.31.138:2181" --create --topic stackoverflow --replication-factor 3 --partitions 20
Created topic "stackoverflow".
~ $ docker run --rm -ti ches/kafka bin/kafka-topics.sh --zookeeper "10.2.31.10:2181,10.2.31.74:2181,10.2.31.138:2181" --describe --topic stackoverflow
Topic:stackoverflow    PartitionCount:20    ReplicationFactor:3    Configs:
    Topic: stackoverflow    Partition: 0    Leader: 1    Replicas: 1,2,0    Isr: 1,2,0
    Topic: stackoverflow    Partition: 1    Leader: 2    Replicas: 2,0,1    Isr: 2,0,1
    Topic: stackoverflow    Partition: 2    Leader: 0    Replicas: 0,1,2    Isr: 0,1,2
    Topic: stackoverflow    Partition: 3    Leader: 1    Replicas: 1,0,2    Isr: 1,0,2
    Topic: stackoverflow    Partition: 4    Leader: 2    Replicas: 2,1,0    Isr: 2,1,0
    Topic: stackoverflow    Partition: 5    Leader: 0    Replicas: 0,2,1    Isr: 0,2,1
    Topic: stackoverflow    Partition: 6    Leader: 1    Replicas: 1,2,0    Isr: 1,2,0
    Topic: stackoverflow    Partition: 7    Leader: 2    Replicas: 2,0,1    Isr: 2,0,1
    Topic: stackoverflow    Partition: 8    Leader: 0    Replicas: 0,1,2    Isr: 0,1,2
    Topic: stackoverflow    Partition: 9    Leader: 1    Replicas: 1,0,2    Isr: 1,0,2
    Topic: stackoverflow    Partition: 10    Leader: 2    Replicas: 2,1,0    Isr: 2,1,0
    Topic: stackoverflow    Partition: 11    Leader: 0    Replicas: 0,2,1    Isr: 0,2,1
    Topic: stackoverflow    Partition: 12    Leader: 1    Replicas: 1,2,0    Isr: 1,2,0
    Topic: stackoverflow    Partition: 13    Leader: 2    Replicas: 2,0,1    Isr: 2,0,1
    Topic: stackoverflow    Partition: 14    Leader: 0    Replicas: 0,1,2    Isr: 0,1,2
    Topic: stackoverflow    Partition: 15    Leader: 1    Replicas: 1,0,2    Isr: 1,0,2
    Topic: stackoverflow    Partition: 16    Leader: 2    Replicas: 2,1,0    Isr: 2,1,0
    Topic: stackoverflow    Partition: 17    Leader: 0    Replicas: 0,2,1    Isr: 0,2,1
    Topic: stackoverflow    Partition: 18    Leader: 1    Replicas: 1,2,0    Isr: 1,2,0
    Topic: stackoverflow    Partition: 19    Leader: 2    Replicas: 2,0,1    Isr: 2,0,1

使用以下代码开始针对该主题进行制作：

import time
from kafka import KafkaProducer
from kafka.errors import KafkaError

producer = KafkaProducer(bootstrap_servers=['10.2.31.10:9092' ,'10.2.31.74:9092' ,'10.2.31.138:9092'])

try:
    count = 0
    while True:
        producer.send('stackoverflow', 'message')
        producer.flush()
        count += 1
        time.sleep(1)
except KeyboardInterrupt:
    print "Sent %s messages." % count

此时我杀了其中一台机器并等到它返回集群。

当它回来时，我会停止制作人并使用该主题中的所有消息。

from kafka import KafkaConsumer

consumer = KafkaConsumer('stackoverflow',
                            bootstrap_servers=['10.2.31.10:9092' ,'10.2.31.74:9092' ,'10.2.31.138:9092'],
                            auto_offset_reset='earliest',
                            enable_auto_commit=False)
try:
    count = 0
    for message in consumer:
        count += 1
        print message
except KeyboardInterrupt:
    print "Received %s messages." % count

缺少已发送的两条消息。制作人没有回复任何错误。

kafka $ python producer.py
Sent 762 messages.

kafka $ python consumer.py
Received 760 messages.

我是Kafka的新手，所以我非常感谢您进一步调试的任何想法。或者有关使群集更具弹性的说明。

感谢您的帮助！

Answer 1

前段时间我遇到了完全相同的问题。在调查期间，我发现了一个有趣的功能：flush()方法在缓冲区中的每条消息发送或请求导致错误后返回，如documentation中所述。

我通过以下方式减轻了它：

在代理商上停用unclean.leader.election.enabled（如果未设置，则为kafka中的true＆lt; 0.11和kafka中的false＆gt; = 0.11，因此您需要在0.10上将其设置为false。 2）
将同步生成器（发送和刷新）转换为同步生成器：producer.send(...).get()
将参数retries=5添加到KafkaProducer init（以使生产者在代理关闭时生效）。

让我知道它是否适合你。

Answer 2

最后，我认为丢失消息的原因是重试次数不足。在阅读了一些关于高度可用的kafka的博客文章后，我注意到人们正在推荐真正高价值的重试＆＃34;参数。

在python中：

producer = KafkaProducer(bootstrap_servers=[...], retries=sys.maxint)

我再次进行了测试，确认没有消息丢失。

重新启动Kafka节点时丢失的消息

2 个答案: