为什么LogisticRegression因" IllegalArgumentException而失败:org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7"?

时间:2017-06-30 10:55:13

标签: java apache-spark apache-spark-sql apache-spark-mllib

我正在尝试在spark中运行简单的逻辑回归程序。 我收到了这个错误:我试图包含各种库来解决问题,但它没有解决问题。

  

java.lang.IllegalArgumentException:要求失败:列pmi   必须是org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7类型,但是   实际上是DoubleType。

这是我的dataSet csv

abc,pmi,sv,h,rh,label
0,4.267034,5,1.618187,5.213683,T
0,4.533071,24,3.540976,5.010458,F
0,6.357766,7,0.440152,5.592032,T
0,4.694365,1,0,6.953864,T
0,3.099447,2,0.994779,7.219463,F
0,1.482493,20,3.221419,7.219463,T
0,4.886681,4,0.919705,5.213683,F
0,1.515939,20,3.92588,6.329699,T
0,2.756057,9,2.841345,6.727063,T
0,3.341671,13,3.022361,5.601656,F
0,4.509981,7,1.538982,6.716471,T
0,4.039118,17,3.206316,6.392757,F
0,3.862023,16,3.268327,4.080564,F
0,5.026574,1,0,6.254859,T
0,3.186627,19,1.880978,8.466048,T
1,6.036507,8,1.376031,4.080564,F
1,5.026574,1,0,6.254859,T
1,-0.936022,23,2.78176,5.601656,F
1,6.435599,3,1.298795,3.408575,T
1,4.769222,3,1.251629,7.201824,F
1,3.190702,20,3.294354,6.716471,F

这是编辑的代码:

import java.io.IOException;

import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.feature.VectorAssembler;

    public class Sp_LogistcRegression {
        public void trainLogisticregression(String path, String model_path) throws IOException {
            //SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
            //  JavaSparkContext sc = new JavaSparkContext(conf);
            SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local[6]").config("spark.driver.memory", "3G").getOrCreate();
             Dataset<Row> training = spark
                    .read()
                    .option("header", "true")
                    .option("inferSchema","true")
                    .csv(path);
             String[] myStrings = {"abc", 
                     "pmi", "sv",   "h",    "rh",   "label"};


             VectorAssembler  VA = new  VectorAssembler().setInputCols(myStrings ).setOutputCol("label");
             Dataset<Row> transform = VA.transform(training);

            LogisticRegression lr = new LogisticRegression().setMaxIter(1000).setRegParam(0.3);


            LogisticRegressionModel lrModel = lr.fit( transform);
            lrModel.save(model_path);

            spark.close();

        }

    }

这是测试。

import java.io.File;
import java.io.IOException;

import org.junit.Test;

public class Sp_LogistcRegressionTest {
    Sp_LogistcRegression spl =new Sp_LogistcRegression ();


    @Test
    public void test() throws IOException {

        String filename = "datas/seg-large.csv";
        ClassLoader classLoader = getClass().getClassLoader();
        File file1 = new File(classLoader.getResource(filename).getFile());
        spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");

    }    
}

更新 根据您的建议,我从数据集中删除了字符串值属性,即label。现在,我得到以下错误。

java.lang.IllegalArgumentException: Field "features" does not exist.
    at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
    at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
    at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
    at scala.collection.AbstractMap.getOrElse(Map.scala:58)
    at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
    at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
    at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)

1 个答案:

答案 0 :(得分:1)

TL; DR 使用VectorAssembler变压器。

Spark MLlib的LogisticRegression要求功能列为VectorUDT类型(如错误消息所示)。

在Spark应用程序中,您从CSV文件中读取数据集,并且用于功能的字段属于不同类型。

请注意,我可以使用Spark MLlib不一定解释在这种情况下机器学习作为研究领域的推荐。

我的建议是使用一个变换器来映射列以匹配LogisticRegression的要求。

快速浏览known transformers in Spark MLlib 2.1.1给我VectorAssembler

  

将多个列合并到矢量列中的特征转换器。

这正是你所需要的。

(我使用Scala,我将代码重写为Java作为你的家庭练习)

val training: DataFrame = ...

// the following are to show that we're on the same page
val lr = new LogisticRegression().setFeaturesCol("pmi")
scala> lr.fit(training)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually IntegerType.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
  at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
  at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
  at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
  at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
  at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
  at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
  at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
  at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
  at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
  ... 48 elided

“休斯顿,我们遇到了问题。”我们先使用VectorAssembler解决问题。

import org.apache.spark.ml.feature.VectorAssembler
val vecAssembler = new VectorAssembler().
  setInputCols(Array("pmi")).
  setOutputCol("features")
val features = vecAssembler.transform(training)
scala> features.show
+---+--------+
|pmi|features|
+---+--------+
|  5|   [5.0]|
| 24|  [24.0]|
+---+--------+

scala> features.printSchema
root
 |-- pmi: integer (nullable = true)
 |-- features: vector (nullable = true)

Whoohoo!我们有features类型的vector列!我们完成了吗?

是。但就我而言,当我使用spark-shell进行实验时,由于lr使用了错误的pmi列(即类型不正确),因此无法立即生效。

scala> lr.fit(features)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually IntegerType.
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
  at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
  at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
  at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
  at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
  at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
  at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
  at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
  at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
  at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
  ... 48 elided

让我们修复lr以使用features列。

请注意,features列是默认列,因此我只需创建LogisticRegression的新实例(我可以也使用setInputCol)。

val lr = new LogisticRegression()

// it works but I've got no label column (with 0s and 1s and hence the issue)
// the main issue was fixed though, wasn't it?
scala> lr.fit(features)
java.lang.IllegalArgumentException: Field "label" does not exist.
  at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
  at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:59)
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
  at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
  at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
  at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
  at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
  at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
  at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
  at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
  at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
  at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
  at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
  ... 48 elided

使用多列

问题的第一个版本更新后,又出现了另一个问题。

scala> va.transform(training)
java.lang.IllegalArgumentException: Data type StringType is not supported.
  at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
  at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
  ... 48 elided

原因是VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type。这意味着用于VectorAssembler的其中一列是StringType类型。

在您的情况下,该列为label,因为它是StringType。查看架构。

scala> training.printSchema
root
 |-- bc: integer (nullable = true)
 |-- pmi: double (nullable = true)
 |-- sv: integer (nullable = true)
 |-- h: double (nullable = true)
 |-- rh: double (nullable = true)
 |-- label: string (nullable = true)

从列中删除它以用于VectorAssembler,错误消失。

但是,如果应该包含此列或任何其他列但类型不正确,则必须适当地进行转换(前提是列可以保留的值)。使用cast方法。

  

cast(to:String):Column 使用该类型的规范字符串表示形式将列转换为其他数据类型。支持的类型包括:stringbooleanbyteshortintlongfloat,{{1 }},doubledecimaldate

错误消息应该包括列名,但是目前它不是我提交的[SPARK-21285 VectorAssembler应该在不支持使用数据类型时报告列名| https://issues.apache.org/jira/browse/SPARK-21285]来修复它。如果您认为在即将推出的Spark版本中有价值,请投票支持。

相关问题