Spark Cross Validation with Training, Testing and Validation sets

时间:2015-10-29 15:58:38

标签: apache-spark pyspark apache-spark-mllib apache-spark-ml

I want to do two Cross Validation processes in Spark using RandomSplits like

  1. CV_global: by splitting data into Training Set 90% and Testing Set 10%

1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.

1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.

1.3 Test Model: on Testing set (10%)

  1. Report Average metrics per 10-fold and global metrics.

The problem is I only find examples using CV and Grid search on the whole training set.

How can I get the parameters of the best performing model from CV_grid?

How to do CV without grid search but get stats per fold? e.g. sklearn.cross_validation.cross_val_score

2 个答案:

答案 0 :(得分:0)

你有像

这样的东西
crossval.setEstimatorParamMaps(paramGrid)

然后

cvModel = crossval.fit(trainingSetDF).bestModel 

对于单个模型(至少对于某些模型),有一些函数,如explainParams()。它可用于火花1.6(也许它可以回到1.4.2,我不确定)。 希望这有帮助

答案 1 :(得分:0)

你有三个问题。每个答案:

1。问题是我只在整个训练集中找到使用CV和网格搜索的例子。

如果您只需要训练数据集的一部分,则按所需百分比进行采样,例如:

training = training.sample(false, .45, 78L)

2。如何从CV_grid获取性能最佳模型的参数?

crossValidatedModel.bestModel().getParamMap() 

从那里得到参数名称,然后是值。

3。如何在没有网格搜索的情况下进行简历但每次折叠获得统计数据e.g。

How can I access computed metrics for each fold in a CrossValidatorModel

的副本

看看这里:Spark CrossValidatorModel access other models than the bestModel?