Question

我正在尝试使用PySpark的RandomForestClassifier来确定功能的重要性，当我看到该数组被所有零条目填充时，我感到困惑。有谁能够解释为什么会这样？

我构建培训管道的代码如下所示：

assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
rfc = RandomForestClassifier(labelCol='label',featuresCol='features')
paramGrid = ParamGridBuilder().addGrid(rfc.maxDepth, [3, 10, 20]).addGrid(rfc.minInfoGain, [0.01, 0.001]).addGrid(rfc.numTrees, [5, 10, 15]).build()
evaluator = BinaryClassificationEvaluator()
pipeline = Pipeline(stages=[assembler, rfc])
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)
rfc_model = crossval.fit(df) # train model
best_model = rfc_model.bestModel 
print(best_model.stages[-1].featureImportances.toArray()) # [0. 0. 0. 0. 0. 0. 0. 0.]

只需添加一下，即使保存并加载了模型后，同样的问题仍然存在。该模型可以很好地进行预测，预测的部分输出如下所示：

features=DenseVector([12000.0, 319.0, 3.0, 8.0, -6.8023, 6.9123, 5.0, 18.0]), rawPrediction=DenseVector([4.9981, 0.0019]), probability=DenseVector([0.9996, 0.0004]), prediction=0.0

我还尝试提取此数组的各个条目并进行检查。非常感谢所有可以帮助我解决这种棘手情况的人！

Spark-ML RandomForestClassifier的特征重要性数组显示所有零

0 个答案: