Question

我是Dataflow编程模型的新手，并且在我认为应该是一个简单的用例时遇到一些麻烦：

我有一个管道从Pub / Sub读取实时数据，这个数据包含设备状态（简化）序列号和状态（UP或DOWN）。设备保证至少每5分钟发送一次状态，但当然设备可能会多次发送相同的状态。

我想要实现的是一个仅为设备发出状态更改的管道，因此基本上跟踪某个给定键的“每个键的最后状态”的概念，并将新事件与之比较。

目前有没有一种好方法可以做到这一点？

Answer 1

There is a related question at "Remove duplicates across window triggers/firings" but your question brings up some subtleties that differ. So let me address two aspects separately and refer some parts to the linked question.

1. Taking the latest input value

Your question differs here because it is not obviously outputting the result of an associative & commutative Combine operation. This is important because in Dataflow & Beam, the input is not ordered - it just carries timestamps so that we can reason about it in event time.

Over pairs of (timestamp, UP/DOWN) you can define an associative & commutative operation that just takes the maximum of the timestamp, and carries the state with it. You'll have to make an arbitrary choice in the case of two equal timestamps, but it sounds like you don't expect to encounter this situation.

In order to express your desires naturally, we would need a feature whereby GroupByKey also does a secondary sort of your values per key (and window). In this case, you would sort by timestamp, but the feature is pretty general and we are aware of the use case.

That will get as far as being able to express the "take the latest value" part of your logic.

2. Only produce output when the result has changed

This aspect corresponds directly to the linked question. Your question is different in that even having defined an associative & commutative operation, you lack a canonical identity element. In the answer there, filtering out of the identity element was key to approximating incremental output.

You could come up with schemes for encoding whether or not a change is necessary, such as expanding your accumulator type to tuples of (timestamp, CHANGE/NO_CHANGE, UP/DOWN) where there is the possibility of a monotonic transition from NO_CHANGE to CHANGE. But this only really helps if you have an identity element tagged with NO_CHANGE. And given an arbitrary choice between UP and DOWN it can only reduce data volume by half.

In your case, the conclusion is actually not the direct expression of "output only when the combined result has changed" but I would more strongly suggest that the right approach is to manage the state machine yourself using the stateful processing features available in Apache Beam, which will be the basis for Dataflow 2.x.

The stateful DoFn code might look something like this:

new DoFn<KV<DeviceId, UpDown>, KV<DeviceId, UpDown>>() {

  @StateId("latestTimestamp")
  private static final StateSpec<Object, ValueState<Instant>> latestTimestampSpec =
      StateSpecs.value(InstantCoder.of());

  @StateId("latestOutput")
  private static final StateSpec<Object, ValueState<UpDown>> latestOutputSpec =
      StateSpecs.value(UpDown.getCoder());

  @ProcessElement
  public void processElement(
      ProcessContext c,
      @StateId("latestTimestamp") latestTimestampState,
      @StateId("latestOutput") latestOutputState) {

    Instant latestTimestamp = latestTimestampState.read();
    UpDown latestOutput = latestOutputState.read();
    Instant newTimestamp = c.element().timestamp();
    UpDown newValue = c.element().getValue();

    if (newTimestamp.isAfter(latestTimestamp)
        && !newValue.equals(latestOutput)) {
      c.output(KV.of(c.element().getKey(), newValue));
      latestTimestampState.write(newTimestamp);
      latestOutputState.write(newValue);
    }
  }
}

This and the linked question are both inspirations for the example I used in this blog post on the Beam blog. So you might read up there for more details.

检测键控状态更改

1 个答案: