更新

Question

下面你们这些人是决策树的片段，因为它非常庞大。

当节点中的最低值小于5时，如何使树停止增长。以下是生成决策树的代码。在{{3}}我们可以看到唯一的方法是 min_impurity_decrease ，但我不确定它是如何具体运作的。

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


X, y = make_classification(n_samples=1000,
                           n_features=6,
                           n_informative=3,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
                                  'Feature 2':X[:,1],
                                  'Feature 3':X[:,2],
                                  'Feature 4':X[:,3],
                                  'Feature 5':X[:,4],
                                  'Feature 6':X[:,5],
                                  'Class':y})


y_train = df['Class']
X_train = df.drop('Class',axis = 1)

dt = DecisionTreeClassifier( random_state=42)                
dt.fit(X_train, y_train)

from IPython.display import display, Image
import pydotplus
from sklearn import tree
from sklearn.tree import _tree
from sklearn import tree
import collections
import drawtree
import os  

os.environ["PATH"] += os.pathsep + 'C:\\Anaconda3\\Library\\bin\\graphviz'

dot_data = tree.export_graphviz(dt, out_file = 'thisIsTheImagetree.dot',
                                 feature_names=X_train.columns, filled   = True
                                    , rounded  = True
                                    , special_characters = True)

graph = pydotplus.graph_from_dot_file('thisIsTheImagetree.dot')  

thisIsTheImage = Image(graph.create_png())
display(thisIsTheImage)
#print(dt.tree_.feature)

from subprocess import check_call
check_call(['dot','-Tpng','thisIsTheImagetree.dot','-o','thisIsTheImagetree.png'])

更新

我认为min_impurity_decrease可以帮助实现这一目标。调整min_impurity_decrease确实会修剪树。任何人都可以解释min_impurity_decrease。

我试图理解scikit学习中的等式，但我不确定right_impurity和left_impurity的价值是什么。

N = 256
N_t = 256
impurity = ??
N_t_R = 242
N_t_L = 14
right_impurity = ??
left_impurity = ??

New_Value = N_t / N * (impurity - ((N_t_R / N_t) * right_impurity)
                    - ((N_t_L / N_t) * left_impurity))
New_Value

更新2

我们在一定条件下修剪，而不是修剪某个值。如我们分为6/4和5/5但不是6000/4或5000/5。让我们说一个值与节点中的相邻值相比是否低于某个百分比，而不是某个值。

      11/9
   /       \
  6/4       5/5
 /   \     /   \
6/0  0/4  2/2  3/3

Answer 1

使用min_impurity_decrease或任何其他内置停止条件无法直接限制叶子的最低值（特定类的出现次数）。

我认为在不更改scikit-learn源代码的情况下实现此目的的唯一方法是 post-prune 您的树。要实现这一点，您可以遍历树并删除节点的所有子节点，最小类数少于5（或您可以想到的任何其他条件）。我会继续你的例子：

from sklearn.tree._tree import TREE_LEAF

def prune_index(inner_tree, index, threshold):
    if inner_tree.value[index].min() < threshold:
        # turn node into a leaf by "unlinking" its children
        inner_tree.children_left[index] = TREE_LEAF
        inner_tree.children_right[index] = TREE_LEAF
    # if there are shildren, visit them as well
    if inner_tree.children_left[index] != TREE_LEAF:
        prune_index(inner_tree, inner_tree.children_left[index], threshold)
        prune_index(inner_tree, inner_tree.children_right[index], threshold)

print(sum(dt.tree_.children_left < 0))
# start pruning from the root
prune_index(dt.tree_, 0, 5)
sum(dt.tree_.children_left < 0)

此代码将首先打印74，然后打印91。这意味着代码创建了17个新的叶节点（通过实际删除其祖先的链接）。树，以前看起来像

现在看起来像

所以你可以看到确实已经减少了很多。

Answer 2

编辑：这是不正确的，因为@SBylemans和@Viktor在评论中指出了这一点。我没有删除答案，因为其他人也可能认为这是解决方案。

将min_samples_leaf设为5。

min_samples_leaf：

叶子节点所需的最小样本数量：

更新：我认为无法使用min_impurity_decrease完成此操作。考虑以下情况：

      11/9
   /         \
  6/4       5/5
 /   \     /   \
6/0  0/4  2/2  3/3

根据您的规则，您不希望拆分节点6/4，因为4小于5但您想要拆分5/5节点。但是，拆分6/4节点的信息增益为0.48，分割5/5的信息增益为0。

Answer 3

有趣的是，min_impurity_decrease看起来并不允许您提供的摘录中显示的任何节点的增长（分裂后的杂质之和等于分裂前的杂质，因此存在没有杂质减少）。但是，虽然它不能完全为您提供所需的结果（如果最小值低于5，则终止节点），但可能会给您类似的结果。

如果我的测试正确，那么官方文档会使它看起来比实际的更加复杂。只需从潜在的父节点中取较低的值，然后减去建议的新节点的较低值的总和-这就是总杂质减少量。然后将其除以整棵树中的样本总数-如果节点被拆分，这将为您实现减少杂质杂质。

如果您有1000个样本，并且一个节点的值为5（即5个“杂质”），则5/1000代表如果该节点被完全拆分，则可以实现的最大杂质减少量。因此，将min_impurity_decrease设置为0.005会近似以小于5的杂质停止叶子。实际上，它会使大多数叶子中的杂质多于5个（取决于所提议的拆分产生的杂质），所以它只是一个近似值，但最好的是，我可以告诉它它最接近而无需后期修剪。 / p>

Answer 4

我对“ David Dale”的答复被删除，我必须这样说 “ David Dale”提出的方法并不完美，因为修剪后sklearn模型的以下参数不会同步更改：

`

model.tree_.impurity,
model.tree_.value,
model.tree_.children_left,
model.tree_.children_right

` 以及要在sklearn CART模型上执行CCP（成本复杂度修剪）算法时，

您无法通过上述方法实现该算法。

补充：我已经成功地在Sklearn的模型上实现了成本复杂度修剪，以下是链接： https://github.com/appleyuchi/Decision_Tree_Prune

Answer 5

在Scikit学习库中，您拥有名为ccp_alpha的参数作为DescissionTreeClassifier的参数。使用此功能，您可以对DecessionTrees进行后共修剪。签出https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

修剪决策树

更新

更新2

5 个答案:

您无法通过上述方法实现该算法。