org.apache.spark.SparkException:任务不可序列化(由org.apache.hadoop.conf.Configuration引起)

时间:2016-06-28 17:28:11

标签: scala hadoop elasticsearch apache-spark

我想将转换后的流写入Elasticsearch索引,如下所示:

transformed.foreachRDD(rdd => {
  if (!rdd.isEmpty()) {
    val messages = rdd.map(prepare)
    messages.saveAsNewAPIHadoopFile("-", classOf[NullWritable], classOf[MapWritable], classOf[EsOutputFormat], ec)
  }
})

val messages = rdd.map(prepare)行会引发错误(见下文)。我不得不尝试不同的方法来解决此问题(例如,在@transient旁边添加val conf),但似乎没有任何效果。

  

6/06/28 19:23:00错误JobScheduler:运行作业流作业时出错   1467134580000 ms.0 org.apache.spark.SparkException:任务没有   可序列化的   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304)     在   org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294)     在   org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122)     在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at   org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:324)at at   org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:323)at at   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)at   org.apache.spark.rdd.RDD.map(RDD.scala:323)at   de.kp.spark.elastic.stream.EsStream $$ anonfun $运行$ 1.适用(EsStream.scala:77)     在   de.kp.spark.elastic.stream.EsStream $$ anonfun $运行$ 1.适用(EsStream.scala:75)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用$ MCV $ SP(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在scala.util.Try $ .apply(Try.scala:161)at   org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)at at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用$ MCV $ SP(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)引起:   java.io.NotSerializableException:org.apache.hadoop.conf.Configuration   序列化堆栈:      - 对象不可序列化(类:org.apache.hadoop.conf.Configuration,value:配置:   core-default.xml,core-site.xml,mapred-default.xml,mapred-site.xml,   yarn-default.xml,yarn-site.xml)      - field(class:de.kp.spark.elastic.stream.EsStream,name:de $ kp $ spark $ elastic $ stream $ EsStream $$ conf,type:class   org.apache.hadoop.conf.Configuration)      - object(类de.kp.spark.elastic.stream.EsStream,de.kp.spark.elastic.stream.EsStream@6b156e9a)      - field(class:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,name:$ outer,type:class de.kp.spark.elastic.stream.EsStream)      - object(类de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,)      - field(类:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,name:   $ outer,类型:class   de.kp.spark.elastic.stream.EsStream $$ anonfun $运行$ 1)      - 对象(类de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,   ) 在   org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)     在   org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)     在   org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)     在   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)     ... 30多个线程中的异常" main"   org.apache.spark.SparkException:任务不可序列化   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:304)     在   org.apache.spark.util.ClosureCleaner $ .ORG $阿帕奇$火花$ UTIL $ ClosureCleaner $$干净(ClosureCleaner.scala:294)     在   org.apache.spark.util.ClosureCleaner $清洁机壳(ClosureCleaner.scala:122)     在org.apache.spark.SparkContext.clean(SparkContext.scala:2055)at   org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:324)at at   org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:323)at at   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:150)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:111)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:316)at   org.apache.spark.rdd.RDD.map(RDD.scala:323)at   de.kp.spark.elastic.stream.EsStream $$ anonfun $运行$ 1.适用(EsStream.scala:77)     在   de.kp.spark.elastic.stream.EsStream $$ anonfun $运行$ 1.适用(EsStream.scala:75)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $ foreachRDD $ 1 $$ anonfun $ $应用MCV $ SP $ 3.apply(DStream.scala:661)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(ForEachDStream.scala:50)     在   org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用$ MCV $ SP(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在   org.apache.spark.streaming.dstream.ForEachDStream $$ anonfun $ 1.适用(ForEachDStream.scala:49)     在scala.util.Try $ .apply(Try.scala:161)at   org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)at at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用$ MCV $ SP(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler $$ anonfun $运行$ 1.适用(JobScheduler.scala:224)     在scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)at   org.apache.spark.streaming.scheduler.JobScheduler $ JobHandler.run(JobScheduler.scala:223)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)引起:   java.io.NotSerializableException:org.apache.hadoop.conf.Configuration   序列化堆栈:      - 对象不可序列化(类:org.apache.hadoop.conf.Configuration,value:配置:   core-default.xml,core-site.xml,mapred-default.xml,mapred-site.xml,   yarn-default.xml,yarn-site.xml)      - field(class:de.kp.spark.elastic.stream.EsStream,name:de $ kp $ spark $ elastic $ stream $ EsStream $$ conf,type:class   org.apache.hadoop.conf.Configuration)      - object(类de.kp.spark.elastic.stream.EsStream,de.kp.spark.elastic.stream.EsStream@6b156e9a)      - field(class:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,name:$ outer,type:class de.kp.spark.elastic.stream.EsStream)      - object(类de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1,)      - field(类:de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,name:   $ outer,类型:class   de.kp.spark.elastic.stream.EsStream $$ anonfun $运行$ 1)      - 对象(类de.kp.spark.elastic.stream.EsStream $$ anonfun $ run $ 1 $$ anonfun $ 2,   ) 在   org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)     在   org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)     在   org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)     在   org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:301)     ......还有30多个

它是否与Hadoop的配置有某种关系? (我引用此消息:class: org.apache.hadoop.conf.Configuration, value: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml

更新

class EsStream(name:String,conf:HConf) extends SparkBase with Serializable {

  /* Elasticsearch configuration */ 
  val ec = getEsConf(conf)               

  /* Kafka configuration */
  val (kc,topics) = getKafkaConf(conf)

  def run() {

    val ssc = createSSCLocal(name,conf)

    /*
     * The KafkaInputDStream returns a Tuple where only the second component
     * holds the respective message; we therefore reduce to a DStream[String]
     */
    val stream = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,kc,topics,StorageLevel.MEMORY_AND_DISK).map(_._2)
    /*
     * Inline transformation of the incoming stream by any function that maps 
     * a DStream[String] onto a DStream[String]
     */
    val transformed = transform(stream)
    /*
     * Write transformed stream to Elasticsearch index
     */
    transformed.foreachRDD(rdd => {
      if (!rdd.isEmpty()) {
        val messages = rdd.map(prepare)
        messages.saveAsNewAPIHadoopFile("-", classOf[NullWritable], classOf[MapWritable], classOf[EsOutputFormat], ec)
      }
    })

    ssc.start()
    ssc.awaitTermination()    

  }

  def transform(stream:DStream[String]) = stream

  private def getEsConf(config:HConf):HConf = {

    val _conf = new HConf()

    _conf.set("es.nodes", conf.get("es.nodes"))
    _conf.set("es.port", conf.get("es.port"))

    _conf.set("es.resource", conf.get("es.resource"))

    _conf

  }

  private def getKafkaConf(config:HConf):(Map[String,String],Map[String,Int]) = {

    val cfg = Map(
      "group.id" -> conf.get("kafka.group"),

      "zookeeper.connect" -> conf.get("kafka.zklist"),
      "zookeeper.connection.timeout.ms" -> conf.get("kafka.timeout")

    )

    val topics = conf.get("kafka.topics").split(",").map((_,conf.get("kafka.threads").toInt)).toMap   

    (cfg,topics)

  }

  private def prepare(message:String):(Object,Object) = {

    val m = JSON.parseFull(message) match {
      case Some(map) => map.asInstanceOf[Map[String,String]]
      case None => Map.empty[String,String]
    }

    val kw = NullWritable.get

    val vw = new MapWritable
    for ((k, v) <- m) vw.put(new Text(k), new Text(v))

    (kw, vw)

  }

}

1 个答案:

答案 0 :(得分:0)

EsStream的类构造函数中删除class EsStream(name:String),并将其写为public def init(conf:HConf):Map(String,String)

接下来创建一个带签名的方法:ec

在此方法中,您将阅读所需的配置并更新(kc,topics)和{{1}}。

在此之后你应该调用你的run方法。