Why can't I get the same results as GridSearchCV?

时间:2019-04-16 23:19:38

标签: python machine-learning scikit-learn

GridSearchCV only returns a score for each parametrization and I would like to see an Roc Curve as well to better understand the results. In order to do this, I would like to take the best performing model from GridSearchCV and reproduce these same results but cache the probabilities. Here is my code

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from tqdm import tqdm

import warnings
warnings.simplefilter("ignore")

data = make_classification(n_samples=100, n_features=20, n_classes=2, 
                           random_state=1, class_sep=0.1)
X, y = data


small_pipe = Pipeline([
    ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100))), 
    ('clf', LogisticRegression())
])

params = {
    'clf__class_weight': ['balanced'],
    'clf__penalty'     : ['l1', 'l2'],
    'clf__C'           : [0.1, 0.5, 1.0],
    'rfs__max_features': [3, 5, 10]
}
key_feats = ['mean_train_score', 'mean_test_score', 'param_clf__C', 
             'param_clf__penalty', 'param_rfs__max_features']

skf = StratifiedKFold(n_splits=5, random_state=0)

all_results = list()
for _ in tqdm(range(25)):
    gs = GridSearchCV(small_pipe, param_grid=params, scoring='roc_auc', cv=skf, n_jobs=-1);
    gs.fit(X, y);
    results = pd.DataFrame(gs.cv_results_)[key_feats]
    all_results.append(results)


param_group = ['param_clf__C', 'param_clf__penalty', 'param_rfs__max_features']
all_results_df = pd.concat(all_results)
all_results_df.groupby(param_group).agg(['mean', 'std']
                    ).sort_values(('mean_test_score', 'mean'), ascending=False).head(20)

Here is my attempt at reproducing the results

small_pipe_w_params = Pipeline([
    ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=3)), 
    ('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=0.1))
])
skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()
for _ in range(25):
    scores = list()
    for train, test in skf.split(X, y):
        small_pipe_w_params.fit(X[train, :], y[train])
        probas = small_pipe_w_params.predict_proba(X[test, :])[:, 1]
        # cache probas here to build an Roc w/ conf interval later
        scores.append(roc_auc_score(y[test], probas))
    all_scores.extend(scores)

print('mean: {:<1.3f}, std: {:<1.3f}'.format(np.mean(all_scores), np.std(all_scores)))

I'm running the above multiple times as the results seem unstable. I have created a challenging dataset as my own dataset is equally as hard to learn. The groupby is meant to take all iterations of GridSearchCV and average & std the train and test scores to stabilize results. I then pick out the best performing model (C=0.1, penalty=l2 and max_features=3 in my most recent model) and try to reproduce these same results when I put those params in deliberately.

The GridSearchCV model yields a 0.63 mean and 0.042 std roc score whereas my own implementation gets 0.59 mean and std 0.131 roc. The grid search scores are considerably better. If I run this experiment out for 100 iterations for both GSCV and my own, the results are similar.

Why are these results not the same? They both internally use StratifiedKFold() when an integer for cv is supplied... and maybe GridSearchCV weights the scores by size of fold? I'm not sure on that, it would make sense though. Is my implementation flawed?

edit: random_state added to SKFold

1 个答案:

答案 0 :(得分:0)

如果您将RandomForestClassifier的random_state设置为set,则将消除不同girdsearchCV之间的差异。

为简单起见,我将n_estimators设置为10,并得到以下结果

                                                             mean_train_score           mean_test_score
param_clf__C    param_clf__penalty  param_ rfs_max_features       mean        std     mean          std         
        1.0      l2                   5 0.766701    0.000000    0.580727    0.0  10 0.768849    0.000000    0.577737    0.0

现在,如果使用

来查看最佳超级参数的每个拆分(通过删除key_feats过滤)的性能,
all_results_df.sort_values(('mean_test_score'), ascending=False).head(1).T

我们会得到

    16
mean_fit_time   0.228381
mean_score_time 0.113187
mean_test_score 0.580727
mean_train_score    0.766701
param_clf__C    1
param_clf__class_weight balanced
param_clf__penalty  l2
param_rfs__max_features 5
params  {'clf__class_weight': 'balanced', 'clf__penalt...
rank_test_score 1
split0_test_score   0.427273
split0_train_score  0.807051
split1_test_score   0.47
split1_train_score  0.791745
split2_test_score   0.54
split2_train_score  0.789243
split3_test_score   0.78
split3_train_score  0.769856
split4_test_score   0.7
split4_train_score  0.67561
std_fit_time    0.00586908
std_score_time  0.00152781
std_test_score  0.13555
std_train_score 0.0470554

让我们复制一下!

skf = StratifiedKFold(n_splits=5, random_state=0)
all_scores = list()

scores = []
weights = []


for train, test in skf.split(X, y):
    small_pipe_w_params = Pipeline([
                ('rfs', SelectFromModel(RandomForestClassifier(n_estimators=10, 
                                                               random_state=0),max_features=5)), 
                ('clf', LogisticRegression(class_weight='balanced', penalty='l2', C=1.0,random_state=0))
            ])
    small_pipe_w_params.fit(X[train, :], y[train])
    probas = small_pipe_w_params.predict_proba(X[test, :])
    # cache probas here to build an Roc w/ conf interval later
    scores.append(roc_auc_score(y[test], probas[:,1]))
    weights.append(len(test))

print(scores)
print('mean: {:<1.6f}, std: {:<1.3f}'.format(np.average(scores, axis=0, weights=weights), np.std(scores)))
  

[0.42727272727272736,0.47,0.54,0.78,0.7]
  平均值:0.580727,标准差:0.135

注意:mean_test_score不仅仅是简单的平均值,而是加权平均值。 原因是iid参数

来自Documentation

  

iid:布尔值,默认=“警告”。如果为True,则返回   倍数,由每个测试集中的样本数加权。在这   在这种情况下,假设数据在整个   折叠,损失最小化是每个样品的总损失,而不是   褶皱的平均损失。如果为False,则返回平均分数   褶皱。默认值为True,但在版本中将更改为False   0.21,对应于交叉验证的标准定义。

     

在0.20版中进行了更改:参数iid从True更改为False   默认情况下为0.22版,并将在0.24版中删除。