Question

I want to do two Cross Validation processes in Spark using RandomSplits like

1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.

1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.

1.3 Test Model: on Testing set (10%)

The problem is I only find examples using CV and Grid search on the whole training set.

How can I get the parameters of the best performing model from CV_grid?

How to do CV without grid search but get stats per fold? e.g. sklearn.cross_validation.cross_val_score

Answer 1

你有像

这样的东西

crossval.setEstimatorParamMaps(paramGrid)

然后

cvModel = crossval.fit(trainingSetDF).bestModel

对于单个模型（至少对于某些模型），有一些函数，如explainParams（）。它可用于火花1.6（也许它可以回到1.4.2，我不确定）。希望这有帮助

Answer 2

你有三个问题。每个答案：

1。问题是我只在整个训练集中找到使用CV和网格搜索的例子。

如果您只需要训练数据集的一部分，则按所需百分比进行采样，例如：

training = training.sample(false, .45, 78L)

2。如何从CV_grid获取性能最佳模型的参数？

crossValidatedModel.bestModel().getParamMap()

从那里得到参数名称，然后是值。

3。如何在没有网格搜索的情况下进行简历但每次折叠获得统计数据e.g。

的副本