解决SparkException:导入PMML模型时,任务不可序列化

时间:2016-02-17 14:49:23

标签: java serialization apache-spark pmml

我想导入PMML模型,使用Spark计算得分。当我不使用spark时,一切正常,但我不能在mapper中使用我的方法。

问题是我需要来自org.jpmml.evaluator.Evaluator的Evaluation对象,它似乎不是Serializable。所以我尝试使用以下类来创建Serialiazable:

package util;

import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;

import org.jpmml.evaluator.Evaluator;

public class SerializableEvaluator implements Serializable {

    private static final long serialVersionUID = 6631604036553063657L;
    private Evaluator evaluator;

    public SerializableEvaluator(Evaluator evaluator) {
        this.evaluator = evaluator;
    }

    public Evaluator getEvaluator() {
        return evaluator;
    }

    private void writeObject(ObjectOutputStream out) throws IOException {
        out.writeObject(evaluator);
    }

    private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
        Evaluator eval = (Evaluator) in.readObject();
    }
}

我也使我的所有类都可序列化。

以下是我的代码示例:

        logger.info("Print 5 first rows----------------------------");
        strTitanicRDD
                .take(5)
                .forEach(row -> logger.info(row));
        logger.info("Print 5 first Titatnic Obs---------------------");
        strTitanicRDD
                .map(row -> new TitanicObservation(row))
                .take(5)
                .forEach(titanic -> logger.info(titanic.toString()));
        logger.info("Print 5 first Scored Titatnic Obs---------------");

        try{strTitanicRDD
            .map(row -> new TitanicObservation(row))
            .map(
                new Function<TitanicObservation,String>(){

                    private static final long serialVersionUID = -2968122030659306400L;

                    @Override
                    public String call(TitanicObservation titanic) throws Exception {
                        String res = PmmlUtil.computeScoreTitanic(evaluator, titanic);
                        return res;
                    }

                })
        .take(5)
        .forEach(row -> logger.info(row));

但我不认为我的代码会帮助你解决我的问题,这很清楚(参见日志:)

  

org.apache.spark.SparkException:任务不可序列化           在org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:166)           在org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:158)           在org.apache.spark.SparkContext.clean(SparkContext.scala:1623)           在org.apache.spark.rdd.RDD.map(RDD.scala:286)           在org.apache.spark.api.java.JavaRDDLike $ class.map(JavaRDDLike.scala:89)           在org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:46)           at score.acv.AppWithSpark.main(AppWithSpark.java:117)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           在java.lang.reflect.Method.invoke(Method.java:497)           在org.apache.spark.deploy.SparkSubmit $ .org $ apache $ spark $ deploy $ SparkSubmit $$ runMain(SparkSubmit.scala:577)           在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:174)           在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:197)           在org.apache.spark.deploy.SparkSubmit $ .main(SparkSubmit.scala:112)           在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

     

引起:java.io.NotSerializableException:   org.xml.sax.helpers.LocatorImpl序列化堆栈:

    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
    ... 15 more

1 个答案:

答案 0 :(得分:1)

org.jpmml.evaluator.Evaluator接口后面有一个org.jpmml.evaluator.ModelEvaluator子类的实例。类ModelEvaluator及其所有子类都可以按设计进行序列化。该问题与您在开始时提供给org.dmg.pmml.PMML方法的ModelEvaluatorFactory#newModelManager(PMML)对象实例有关。

简而言之,每个PMML类模型对象都可以附加SAX定位器信息。这在定位违规XML内容的开发和测试阶段非常有用。但是,在生产阶段,此信息不应再保留。您可以通过正确配置JAXB运行时来禁用SAX定位器信息,也可以通过使用PMMLObject#setLocator(Locatable)参数调用null来清除现有的SAX定位器实例。后一种功能由org.jpmml.model.visitors.LocatorNullifier访客类正式化。

有关完整示例,请参阅官方JPMML-Spark projectorg.jpmml.spark.EvaluatorUtil实用程序类(尤其是第73到75行)。你为什么不首先使用JPMML-Spark?