从cross_val_score获取概率

时间:2017-11-12 23:03:04

标签: python scikit-learn

我使用嵌套交叉验证在python中有以下机器学习管道:

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)
pipe_svm = Pipeline([('scl', StandardScaler()), ('clf', SVC(kernel="linear"))])
parameters = {'clf__C': logspace(-4, 1, 50)}
grid_search = GridSearchCV(estimator=pipe_svm, param_grid=parameters, verbose=1, scoring='roc_auc', cv=sss_inner)
cross_val_score(grid_search, X, y, cv=sss_outer)

现在我想从cross_val_score中获取概率,以便我可以计算AUC并绘制ROC和精度/召回曲线。怎么办呢?

2 个答案:

答案 0 :(得分:2)

您可以使用sklearn.metrics.roc_curve函数计算模型的ROC分数

以下是使用SVM分类器的示例代码段:

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc

X = {
    "train": [...],
    "text": [...]
}


Y = {
    "train": [...],
    "text": [...]
}

sss_outer = StratifiedShuffleSplit(n_splits=5, test_size=0.4, random_state=15)
sss_inner = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=16)

pipe_svm = Pipeline([('scl', StandardScaler()), ('clf', SVC(kernel="linear"))])
parameters = {'clf__C': logspace(-4, 1, 50)}
grid_search = GridSearchCV(estimator=pipe_svm, param_grid=parameters, verbose=1, scoring='roc_auc', cv=sss_inner)

probas_ = grid_search.fit(X[train], y[train]).predict_proba(X[test])
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
roc_auc = auc(fpr, tpr)

您还可以访问sklearn示例:Receiver Operating Characteristic (ROC) with cross validation了解更多详情。

希望它有所帮助。

答案 1 :(得分:0)

## 3. set up cross validation method
inner_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5)
outer_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5)

## 4. set up inner cross validation parameter tuning, can use this to get AUC
log.model = GridSearchCV(estimator=log, param_grid=log_hyper, cv=inner_cv, scoring='roc_auc')

## 5. ordinary nested cross validation without probabilities
log_scores = cross_val_score(log.model, X, Y, scoring='roc_auc', cv=outer_cv)
print("AUC: %0.2f (+/- %0.2f)" % (log_scores.mean(), log_scores.std() * 2))

## 6. this is to get the probabilities from nested cross validation 
log_scores2 = cross_val_predict(log.model, X, Y, cv=outer_cv,method='predict_proba')
fpr, tpr, thresholds = roc_curve(Y, log_scores2[:, 1])
roc_auc = auc(fpr, tpr)

您可以使用cross_val_predict函数进行嵌套的交叉验证,并得出ROC曲线的概率。据我所知,GridSearchCV不允许像cross_val_predict那样提取交叉验证的概率。

我假设cross_val_predict的值是所有迭代的平均概率。您无法从cross_val_score获得交叉验证的概率。

如果不确定方法是否适合,请尝试对随机数据运行阴性对照。这是个好习惯。