Question

我的kafka主题“数据源”中有一些数据具有以下模式（此处简化为演示）：

{ "deal" : -1, "location": "", "value": -1, "type": "init" }
{ "deal": 123456, "location": "Mars", "value": 100.0, "type": "batch" },
{ "deal" 123457, "location": "Earth", "value", 200.0, "type": "batch" },
{ "deal": -1, "location": "", "value", -1, "type": "commit" }

此数据来自批处理运行，我们接受所有交易并重新计算其价值。可以将其视为每天开始的过程-此时，这里是所有位置的一组新数据。 当前init和commit消息未发送到真实主题时，它们被生产者过滤掉。

白天，事情会随着变化而更新。这提供了新数据（在此示例中，我们可以忽略覆盖数据，因为可以通过重新运行批处理来处理）：

{ "deal": 123458, "location", "Mars", "value": 150.0, "type": "update" }

此数据作为KStream的“位置”进入应用程序。

另一个主题“位置”列出了可能的位置。这些作为KGlobalTable位置被拉入java kafka-streams应用程序：

{ "id": 1, "name": "Mars" },
{ "id": 2, "name": "Earth"}

计划是使用Java 9 kafka-streams应用程序汇总这些值（按位置分组）。输出应类似于：

{ "id": 1, "location": "Earth", "sum": 250.0 },
{ "id": 2, "location": "Mars": "sum": 200.0 }

这是我到目前为止的工作：

StreamsBuilder builder = new StreamsBuilder();

/** snip creating serdes, settings up stores, boilerplate  **/

final GlobalKTable<Integer, Location> locations = builder.globalTable(
                LOCATIONS_TOPIC, 
                /* serdes, materialized, etc */
                );

final KStream<Integer, PositionValue> positions = builder.stream(
                POSITIONS_TOPIC,
                /* serdes, materialized, etc */
            );

/* The real thing is more than just a name, so a transformer is used to match locations to position values, and filter ones that we don't care about */
KStream<Location, PositionValue> joined = positions
                .transform(() -> new LocationTransformer(), POSITION_STORE) 
                .peek((location, positionValue) -> { 
                    LOG.debugv("Processed position {0} against location {1}", positionValue, location);
                });

/** This is where it is grouped and aggregated here **/
joined.groupByKey(Grouped.with(locationSerde, positionValueSerde))
            .aggregate(Aggregation::new, /* initializer */
                       (location, positionValue, aggregation) -> aggregation.updateFrom(location, positionValue), /* adder */
                Materialized.<Location, Aggregation>as(aggrStoreSupplier)
                    .withKeySerde(locationSerde)
                    .withValueSerde(aggregationSerde)
            );

Topology topo = builder.build();

我遇到的问题是，这正在汇总所有内容-因此，每天的批处理，加上更新，然后是下一个每天的批处理，都被添加了。基本上，我需要一种方式说“这是下一组批处理数据，对此进行重置”。我不知道该怎么做-请帮助！

谢谢

Answer 1

因此，如果我对您的理解正确，那么您希望汇总数据（但仅限于最后一天），并丢弃其余部分。

我建议您汇总到一个中间类，该类包含流中的所有值，并且还具有用于过滤掉前几天数据的逻辑。如果我对您的理解正确，那将丢弃所有类型为“ batch”的最后一个数据之前的数据。

尽管在科特林，我已经做了similar solution，您可以根据需要查看。

Answer 2

您可以做一些事情，但是我建议您使用TimeWindowed Stream。您可以将时间设置为1天的滚动窗口，然后对该流执行自协商。您最终将每天汇总在KTable自己的窗口中。这样，您就不必担心丢弃数据（尽管可以），并且每天都会分开。

这里有一些很好的示例：https://www.programcreek.com/java-api-examples/?api=org.apache.kafka.streams.kstream.TimeWindows

Kafka流处理批处理数据以重置聚合

2 个答案: