Flink 应用程序丢失状态,错误无法下载状态句柄的数据

时间:2021-03-04 03:22:09

标签: apache-flink

我们在 EMR 上使用 Flink 1.9 和 RocksDB 状态。状态大小为 100 GB。事件保持在 400 天的状态。我们在 2 月 13 日发生了运行时异常,我们似乎丢失了事件状态。

堆栈跟踪是 -

    2021-02-13 04:54:00,868 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder  - Caught unexpected exception.
org.apache.flink.util.FlinkRuntimeException: Failed to download data for state handles.
    at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:92)
    at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.transferAllStateDataToDirectory(RocksDBStateDownloader.java:66)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.transferRemoteStateToLocalDirectory(RocksDBIncrementalRestoreOperation.java:224)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:189)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)
    at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:253)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:881)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:395)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Retry's backoff was interrupted by other process
    at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
    at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:85)
    ... 18 more
Caused by: java.lang.RuntimeException: Retry's backoff was interrupted by other process
    at com.amazon.ws.emr.hadoop.fs.util.EmrFsUtils.sleep(EmrFsUtils.java:376)
    at com.amazon.ws.emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:166)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.DataInputStream.read(DataInputStream.java:149)
    at org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
    at java.io.InputStream.read(InputStream.java:101)
    at org.apache.flink.core.fs.FSDataInputStreamWrapper.read(FSDataInputStreamWrapper.java:56)
    at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:135)
    at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:109)
    at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50)
    at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
    at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
    at java.util.concurrent.CompletableFuture.asyncRunStage(CompletableFuture.java:1654)
    at java.util.concurrent.CompletableFuture.runAsync(CompletableFuture.java:1871)
    at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:83)
    ... 18 more

任何时候在我们无法在状态中找到事件的错误发生时,我们都会记录它

{ “ID”: “36010671494153310629767457672110260480504804454328631747”, “时间戳”:1614774357250, “消息”:“2021年3月3日12:25:57166 ERROR io.benevity.data.processstate.TransactionAndEntryJoin - 错失交易和交易条目加入-- 没有收到交易记录,交易 id:151422198 和交易条目 id:309836502,记录 ts:Some(1614600053) 和触发计时器 ts:1614772853000 和操作类型:update-transferable 和交易类型:2216"}

有谁知道在这个异常期间状态运算符发生了什么?

0 个答案:

没有答案
相关问题