Question

我正在尝试在GridSearchCV上运行DecisionTreeClassifier，唯一的超参数为max_depth。我运行的两个版本是：

max_depth = range(1,20)

best_estimator_ attribute显示max_depth为15，而评分函数在测试集上显示为0.8880

max_depth = range(1,15)

best_estimator_ attribute显示max_depth为10，得分为0.8907。

我的问题是，为什么GridSearchCV如果得分更高，第一次选择max_depth为10？

代码如下：

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

clf = tree.DecisionTreeClassifier(random_state=7)

parameters = {"max_depth": range(1,20), "random_state":[7]}

scorer = make_scorer(fbeta_score,beta=0.5)

grid_obj = GridSearchCV(estimator=clf,param_grid=parameters,scoring=scorer)

grid_fit =grid_obj.fit(X_train,y_train)

best_clf = grid_fit.best_estimator_

predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print best_clf

print "\nOptimized Model\n------"
print "Final accuracy score on the testing data: 
{:.4f}".format(accuracy_score(y_test, best_predictions))
print "Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, 
best_predictions, beta = 0.5))

Answer 1

您的问题

为什么GridSearchCV第一次没有选择10的max_depth 如果它得分更高？

我的回答（据我所知，我通过太多过去的资料来获得引用）

树越深，它学习的训练数据就越复杂。这被称为“过度拟合”，它可以很好地学习训练数据，但可能无法很好地概括看不见的数据。 为什么是默认的超参数 max_depth=3？这是sklearn团队的设计决策。

但为什么max_depth=3？

开发人员可能通过考虑适用于大多数用例的默认值来确定这一点。他们也可能已经确定3对看不见的数据有更好的概括。

决策树是随机的

每次重新跑步时，你都不会得到同样的best_estimator_。尝试使用random_state使其每次都可重复。

DecisionTreeClassifier上的GridSearchCV

1 个答案: