Question

在PySpark Steaming中，如果启用了检查点并且存在转换联接操作，则会引发错误。

sc=SparkContext(appName='xxxx')
sc.setLogLevel("WARN")
ssc=StreamingContext(sc,10)
ssc.checkpoint("hdfs://xxxx/test")

kafka_bootstrap_servers="xxxx"
topics = ['xxxx', 'xxxx']

doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))
kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers})

line=kvds.map(lambda x:(1,2))

line.transform(lambda rdd:rdd.join(doc_info)).pprint(10)

ssc.start()
ssc.awaitTermination()

错误详细信息：

PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

类似的代码在Scala中效果很好。如果我们删除任何

ssc.checkpoint("hdfs://xxxx/test")

或

line.transform(lambda rdd:rdd.join(doc_info))

也没有错误。

PySpark流检查点和转换时出现“ PicklingError：无法序列化对象”

0 个答案: