如何在Apache Spark中向Kryo Serializer注册类?

时间:2016-06-04 19:52:34

标签: serialization apache-spark pyspark kryo

我正在使用Spark 1.6.1和Python。如何在使用PySpark时启用Kryo序列化?

我在spark-default.conf文件中有以下设置:

spark.eventLog.enabled             true
spark.eventLog.dir                 //local_drive/sparkLogs
spark.default.parallelism          8
spark.locality.wait.node           5s
spark.executor.extraJavaOptions    -XX:+UseCompressedOops
spark.serializer                   org.apache.spark.serializer.KryoSerializer
spark.kryo.classesToRegister      Timing, Join, Select, Predicate, Timeliness, Project, Query2, ScanSelect
spark.shuffle.compress             true

以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258)
at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)

Caused by: java.lang.ClassNotFoundException: Timing
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:120)

主类包含(Query2.py):

from Timing import Timing
from Predicate import Predicate
from Join import Join 
from ScanSelect import ScanSelect 
from Select import Select
from Timeliness import Timeliness
from Project import Project

conf = SparkConf().setMaster(master).setAppName(sys.argv[1]).setSparkHome("$SPARK_HOME")
sc = SparkContext(conf=conf)
conf.set("spark.kryo.registrationRequired", "true")
sqlContext = SQLContext(sc)

我知道“Kryo不会对PySpark产生重大影响,因为它只是将数据存储为byte []对象,即使使用Java也可以快速序列化。但是可能值得尝试设置spark.serializer和不要尝试注册任何类“(Matei Zaharia, 2014)。但是,我需要注册这些课程。

提前致谢。

1 个答案:

答案 0 :(得分:7)

这是不可能的。 Kryo是一个Java(JVM)序列化框架。它不能与Python类一起使用。序列化Python对象PySpark使用的是Python序列化工具,包括标准pickle模块和improved version coludpickle。您可以在Tips for properly using large broadcast variables?中找到有关PySpark序列化的其他信息。

Sp虽然你可以在使用PySpark时启用Kryo序列化,但这不会影响Python对象的序列化方式。它仅用于Java或Scala对象的序列化。