运行dataproc作业时出现错误

时间:2018-12-19 08:38:31

标签: apache-spark pyspark

我正在gcp dataproc集群上运行以下pyspark代码。      RewriteData(object)类:

    def __init__(self):
        self.sc = SparkContext(conf=self.conf)

    def read_data(self):
        data = self.sc.textFile("gcs://test-bucket/input_data/*")

    def run(self):
        data = self.read_data()



if __name__ == "__main__":
    obj = RewriteData()
    obj.run()

但是我遇到以下错误错误。

"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Handling run-time error: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

奇怪的是,当我将spark上下文初始化移到运行方法时,它起作用了。 不明白为什么? 感谢任何帮助

谢谢 曼尼什(Manish)

0 个答案:

没有答案