从Spark会话中的String解释Scala

时间:2017-04-25 20:51:17

标签: scala apache-spark user-defined-functions

我正在尝试从Spark会话中解释包含Scala代码的字符串。一切正常,除了用于用户定义函数之类的东西(UDF,map,flatMap等)。

在网络上有一些对这个问题的引用,答案通常是确保spark-something-or-other标记为sbt中提供的。这种解决方案不适用于此。从spark-shell运行以下每个示例:

序言

import scala.tools.nsc.GenericRunnerSettings
import scala.tools.nsc.interpreter.IMain

val settings = new GenericRunnerSettings( println _ )
settings.usejavacp.value = true
val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
interpreter.bind("spark", spark);

这些工作:

// works:
interpreter.interpret("val x = 5")

// works:
interpreter.interpret("import spark.implicits._\nval df = spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")

// works:
val upper: String => String = _.toUpperCase
spark.udf.register("myUpper", upper)
interpreter.interpret("import org.apache.spark.sql.functions._\nimport spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF = udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\", callUDF(\"myUpper\", ($\"value\"))).show")

这些不起作用:

// doesn't work, fails with seq/RDD serialization error:
interpreter.interpret("import org.apache.spark.sql.functions._\nimport spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF = udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\", upperUDF($\"value\")).show")

// doesn't work, fails with seq/RDD serialization error:
interpreter.interpret("import org.apache.spark.sql.functions._\nimport spark.implicits._\nval upper: String => String = _.toUpperCase\nspark.udf.register(\"myUpper\", upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\", callUDF(\"myUpper\", ($\"value\"))).show")

无法工作的人因此例外而失败:

Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
  at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
  at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
  at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
  at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

感谢任何帮助!

0 个答案:

没有答案