Question

我正在使用带有分层CV的scikit-learn来比较一些分类器。我正在计算：准确性，召回，auc。

我用于参数优化GridSearchCV，其中包含5个CV。

RandomForestClassifier(warm_start= True, min_samples_leaf= 1, n_estimators= 800, min_samples_split= 5,max_features= 'log2', max_depth= 400, class_weight=None)

是GridSearchCV中的best_params。

我的问题，我觉得我真的很适合。例如：

具有标准差的随机森林（+/-）


精度：0.99（+/- 0.06）

灵敏度：0.94（+/- 0.06）

特异性：0.94（+/- 0.06）

B_accuracy：0.94（+/- 0.06）

AUC：0.94（+/- 0.11）


Logistic回归，标准差（+/-）


精度：0.88（+/- 0.06）

灵敏度：0.79（+/- 0.06）

特异性：0.68（+/- 0.06）

B_accuracy：0.73（+/- 0.06）

AUC：0.73（+/- 0.041）

其他人也看起来像逻辑回归（因此他们看起来并不过分）。

我的简历代码是：

for i,j in enumerate(data):
    X.append(data[i][0])
    y.append(float(data[i][1]))
x=np.array(X)
y=np.array(y)

def SD(values):

    mean=sum(values)/len(values)
    a=[]
    for i in range(len(values)):
        a.append((values[i]-mean)**2)
    erg=sum(a)/len(values)
    SD=math.sqrt(erg)
    return SD,mean

    for name, clf in zip(titles,classifiers):
    # go through all classifiers, compute 10 folds 
    # the next for loop should be 1 tab indent more, coudlnt realy format it here, sorry
    pre,sen,spe,ba,area=[],[],[],[],[]
    for train_index, test_index in skf:
        #print train_index, test_index
        #get the index from all train_index and test_index
        #change them to list due to some errors
        train=train_index.tolist()
        test=test_index.tolist()
        X_train=[]
        X_test=[]
        y_train=[]
        y_test=[]
        for i in train:
            X_train.append(x[i])

        for i in test:
            X_test.append(x[i]) 

        for i in train:
            y_train.append(y[i])

        for i in test:
            y_test.append(y[i]) 


        #clf=clf.fit(X_train,y_train)
        #predicted=clf.predict_proba(X_test)
        #... other code, calculating metrics and so on...
    print name 
    print("precision: %0.2f \t(+/- %0.2f)" % (SD(pre)[1], SD(pre)[0]))
    print("sensitivity: %0.2f \t(+/- %0.2f)" % (SD(sen)[1], SD(pre)[0]))
    print("specificity: %0.2f \t(+/- %0.2f)" % (SD(spe)[1], SD(pre)[0]))
    print("B_accuracy: %0.2f \t(+/- %0.2f)" % (SD(ba)[1], SD(pre)[0]))
    print("AUC: %0.2f \t(+/- %0.2f)" % (SD(area)[1], SD(area)[0]))
    print "\n"

如果我使用scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')方法，我不会得到这个“过度拟合”的值。也许我正在使用的CV方法有问题吗？但它只适用于RF ......

由于cross_val_function中特异性评分函数滞后，我自己做了。

Answer 1

赫伯特

如果您的目标是比较不同的学习算法，我建议您使用嵌套交叉验证。（我将学习算法称为不同的算法，例如逻辑回归，决策树和其他判别模型，从您的训练数据中学习假设或模型 - 最终的分类器）。

＆＃34;定期＆＃34;如果您想调整单个算法的超参数，则交叉验证很好。但是，一旦您开始使用相同的交叉验证参数/折叠运行超参数优化，您的性能估计可能会过于乐观。如果您一遍又一遍地运行交叉验证，您的测试数据将成为培训数据＆＃34;在某种程度上。

实际上人们经常问我这个问题，我会从我在这里发布的常见问题解答部分摘录一些：http://sebastianraschka.com/faq/docs/evaluate-a-model.html

在嵌套交叉验证中，我们有一个外部k-fold交叉验证循环，将数据分成训练和测试折叠，内循环用于通过训练中的k-fold交叉验证选择模型折。在模型选择之后，然后使用测试折叠来评估模型性能。在我们确定了我们最喜欢的＆＃34;算法，我们可以跟随＆＃34;常规＆＃34; k-fold交叉验证方法（在完整的训练集上）找到它的最佳＆＃34;超参数并在独立测试集上进行评估。让我们考虑使用逻辑回归模型来更清楚：使用嵌套交叉验证，您将训练m个不同的逻辑回归模型，每个m个外部折叠1个，内部折叠用于优化每个的超参数模型（例如，使用gridsearch结合k折交叉验证。如果你的模型是稳定的，这些m模型应该都具有相同的超参数值，并且你可以根据外部测试折叠报告该模型的平均性能。，继续下一个算法，例如SVM等。

我只能强烈推荐这篇能够更详细地讨论这个问题的优秀论文：

S上。 Varma和R. Simon。使用交叉验证进行模型选择时的误差估计偏差。 BMC生物信息学，7（1）：91,2006。（http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/）

PS：通常，您不需要/想要调整随机森林的超参数（如此广泛）。随机森林（一种套袋形式）背后的想法实际上是不修剪决策树 - 实际上，布莱曼提出随机森林算法的一个原因是处理修剪问题/过度拟合单个决策树。所以，你真正需要的唯一参数是＆＃34;担心＆＃34; about是树的数量（也许是每棵树的随机特征数）。但是，通常情况下，最好选择大小为n的训练自举样本（其中n是训练集中原始的要素数）和squareroot（m）要素（其中m是训练集的维数）。 / p>

希望这有用！

修改

通过scikit-learn做嵌套CV的一些示例代码：

pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))]) param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] param_grid = [{'clf__C': param_range, 'clf__kernel': ['linear']}, {'clf__C': param_range, 'clf__gamma': param_range, 'clf__kernel': ['rbf']}] # Nested Cross-validation (here: 5 x 2 cross validation) # ===================================== gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=5) scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=2) print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

随机森林过度拟合

1 个答案: