Question

因此，对于一门机器学习类，我需要为具有2个类（在这种情况下为0和1）的决策树计算基尼系数。我已经阅读了有关如何计算此值的多个资料，但似乎无法在自己的脚本中使用它。尝试了10种不同的计算后，我感到绝望了。

数组是：

Y_left = np.array([[1.],[0.],[0.],[1.],[1.],[1.],[1.]])
Y_right = np.array([[1.],[0.],[0.],[0.],[1.],[0.],[0.],[1.],[0.]])

输出应为0.42857。

Formula

以C为类别标签（即2）的集合，S_L和S_R为由划分标准确定的两个划分。

我现在拥有的是：

def tree_gini_index(Y_left, Y_right, classes):
    """Compute the Gini Index.
    # Arguments
        Y_left: class labels of the data left set
            np.array of size `(n_objects, 1)`
        Y_right: class labels of the data right set
            np.array of size `(n_objects, 1)`
        classes: list of all class values
    # Output
        gini: scalar `float`
    """
    gini = 0.0
    total = len(Y_left) + len(Y_right)
    gini = sum((sum(Y_left) / total)**2, (sum(Y_right) / total)**2)
    return gini

如果有人能给我有关如何定义此功能的任何指导，我将不胜感激。

Answer 1

此函数为每个left或right标签数组计算基尼索引。 probs只需根据您的公式存储每个类别的概率p_c。

import numpy as np

def gini(y, classes):

    y = y.reshape(-1, )                             # Just flattens the 2D array into 1D array for simpler calculations
    if not y.shape[0]:
        return 0
    
    probs = []
    for cls in classes:
        probs.append((y == cls).sum() / y.shape[0]) # For each class c in classes compute class probabilities
    
    p = np.array(probs)
    return 1 - ((p*p).sum())

此后，此函数计算它们的加权（按样本数）平均值，以生成相应拆分的最终基尼系数值。请注意，p_L和p_R在您的公式中起|S_n|/|S|的作用，其中n是{left, right}。

def tree_gini_index(Y_left, Y_right, classes):
    
    N = Y_left.shape[0] + Y_right.shape[0]
    p_L = Y_left.shape[0] / N
    p_R = Y_right.shape[0] / N
    
    return p_L * gini(Y_left, classes) + p_R * gini(Y_right, classes)

称呼为：

Y_left = np.array([[1.],[0.],[0.],[1.],[1.],[1.],[1.]])
Y_right = np.array([[1.],[0.],[0.],[0.],[1.],[0.],[0.],[1.],[0.]])
tree_gini_index(Y_left, Y_right, [0, 1])

输出：

0.4285714285714286

如何使用两个numpy数组计算基尼系数

1 个答案: