如何在apache-beam python中持久保存外部获取的有状态数据?

时间:2019-06-05 19:38:26

标签: apache-beam

在我的apache-beam作业中,我将外部源称为GCP存储,这可以看作是出于通用目的的http调用,重要的部分是它是外部调用,以丰富工作。

我正在处理的每条数据,我都调用此API来获取一些信息以丰富数据。在API上有大量重复调用相同数据的操作。

是否存在一种缓存或存储结果的好方法,以供处理的每个数据重用以限制所需的网络流量。这是处理的巨大瓶颈。

2 个答案:

答案 0 :(得分:0)

Beam中没有内部持久层。您必须下载要处理的数据。这可能会在所有必须访问数据的工人团队中发生。

但是,您可能希望考虑将数据作为辅助输入来访问。您将必须预加载所有数据,而无需查询每个元素的外部源:https://beam.apache.org/documentation/programming-guide/#side-inputs

对于GCS,您可能想尝试使用现有的IO,例如TextIO:https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java

答案 1 :(得分:0)

您可以考虑将该值作为DoFn上的实例状态持久化。例如

class MyDoFn(beam.DoFn):
    def __init__(self):
        # This will be called during construction and pickled to the workers.
        self.value1 = some_api_call()

    def setup(self):
        # This will be called once for each DoFn instance (generally
        # once per worker), good for non-pickleable stuff that won't change.
        self.value2 = some_api_call()

    def start_bundle(self):
        # This will be called per-bundle, possibly many times on a worker.
        self.value3 = some_api_call()

    def process(self, element):
        # This is called on each element.
        key = ...
        if key not in self.some_lru_cache:
            self.some_lru_cache[key] = some_api_call()
        value4 = self.some_lru_cache[key]
        # Use self.value1, self.value2, self.value3 and/or value4 here.
相关问题