RandomForest,如何选择最佳的n_estimator参数

时间:2018-09-26 08:41:14

标签: python machine-learning scikit-learn random-forest cross-validation

我想训练我的模型并选择最佳的树木数量。代码在这里

from sklearn.ensemble import RandomForestClassifier

tree_dep = [3,5,6]
tree_n = [2,5,7]

avg_rf_f1 = []
search = []

for x in tree_dep:
  for y in tree_n:
    search.append((a,b))
    rf_model = RandomForestClassifier(n_estimators=tree_n, max_depth=tree_dep, random_state=42)
    rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')

    avg_rf_f1.append(np.mean(rf_scores))

best_tree_dep, best_n = search[np.argmax(avg_rf_f1)]

错误在这一行

rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')

ValueError: n_estimators must be an integer, got <class 'list'>.

想知道如何解决它。谢谢!!!

2 个答案:

答案 0 :(得分:3)

在scikit-learn中有一个名为GridSearchCV 的帮助程序函数。它获取要测试的参数值的列表,并使用所有可能的值训练分类器,以返回最佳的参数集。
我建议它比您要实现的嵌套循环方法更干净,更快捷。它可以轻松扩展到其他参数(只需将所需的参数添加到网格中)即可并行化。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

params_to_test = {
    'n_estimators':[2,5,7],
    'max_depth':[3,5,6]
}

#here you can put any parameter you want at every run, like random_state or verbosity
rf_model = RandomForestClassifier(random_state=42)
#here you specify the CV parameters, number of folds, numberof cores to use...
grid_search = GridSearchCV(rf_model, param_grid=params_to_test, cv=10, scoring='f1_macro', n_jobs=4)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_ 

#best_params is a dict you can pass directly to train a model with optimal settings 
best_model = RandomForestClassifier(**best_params)

正如注释中指出的那样,最好的模型存储在grid_search对象中,因此不要使用以下方法创建新模型:

best_model = RandomForestClassifier(**best_params)

我们可以使用grid_search中的一个:

best_model = grid_search.best_estimator_

答案 1 :(得分:1)

您可以循环访问列表中的元素,但不要在循环中使用它们。您可以提供整个列表,而不是将列表中的元素提供为n_estimatorsmax_depth。这应该可以解决它,现在在每次迭代中,您将两个列表中的元素进行不同的组合:

from sklearn.ensemble import RandomForestClassifier

tree_dep = [3,5,6]
tree_n = [2,5,7]

avg_rf_f1 = []
search = []

for x in tree_dep:
  for y in tree_n:
    search.append((a,b))
    rf_model = RandomForestClassifier(n_estimators=y, max_depth=x, random_state=42)
    rf_scores = cross_val_score(rf_model, X_train, y_train, cv=10, scoring='f1_macro')

    avg_rf_f1.append(np.mean(rf_scores))

best_tree_dep, best_n = search[np.argmax(avg_rf_f1)]