如何为班级概率选择最佳阈值?

时间:2018-08-30 09:18:36

标签: python machine-learning scikit-learn neural-network

我的神经网络输出是多标签分类的预测类概率表:

print(probabilities)

|   |      1       |      3       | ... |     8354     |     8356     |     8357     |
|---|--------------|--------------|-----|--------------|--------------|--------------|
| 0 | 2.442745e-05 | 5.952136e-06 | ... | 4.254002e-06 | 1.894523e-05 | 1.033957e-05 |
| 1 | 7.685694e-05 | 3.252202e-06 | ... | 3.617730e-06 | 1.613792e-05 | 7.356643e-06 |
| 2 | 2.296657e-06 | 4.859554e-06 | ... | 9.934525e-06 | 9.244772e-06 | 1.377618e-05 |
| 3 | 5.163169e-04 | 1.044035e-04 | ... | 1.435158e-04 | 2.807420e-04 | 2.346930e-04 |
| 4 | 2.484626e-06 | 2.074290e-06 | ... | 9.958628e-06 | 6.002510e-06 | 8.434519e-06 |
| 5 | 1.297477e-03 | 2.211737e-04 | ... | 1.881772e-04 | 3.171079e-04 | 3.228884e-04 |

我使用阈值( 0.2 )将其转换为类别标签,用于测量预测的准确性:

predictions = (probabilities > 0.2).astype(np.int)
print(predictions)

|   | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... |    0 |    0 |    0 |
| 1 | 0 | 0 | ... |    0 |    0 |    0 |
| 2 | 0 | 0 | ... |    0 |    0 |    0 |
| 3 | 0 | 0 | ... |    0 |    0 |    0 |
| 4 | 0 | 0 | ... |    0 |    0 |    0 |
| 5 | 0 | 0 | ... |    0 |    0 |    0 |

我也有一个测试仪:

print(Y_test)

|   | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... |    0 |    0 |    0 |
| 1 | 0 | 0 | ... |    0 |    0 |    0 |
| 2 | 0 | 0 | ... |    0 |    0 |    0 |
| 3 | 0 | 0 | ... |    0 |    0 |    0 |
| 4 | 0 | 0 | ... |    0 |    0 |    0 |
| 5 | 0 | 0 | ... |    0 |    0 |    0 |

问题::如何在Python中构建算法,该算法将选择最大化roc_auc_score(average = 'micro')或其他指标的最佳阈值?

也许可以在Python中构建手动函数来优化阈值,具体取决于准确性指标。

3 个答案:

答案 0 :(得分:1)

最好的方法是将逻辑回归放在新数据集的顶部。它将每个概率乘以某个常数,从而在输出上提供一个自动阈值(使用LR,您只需要预测类别而不是概率)

您需要通过将测试集一分为二来训练它,并在用NN预测输出后使用一部分训练LR。

这不是唯一的方法,但是每次都对我有效。

我们有X_train_nn,X_valid_nn,X_test_NN,然后将X_test_NN细分为X_train_LR,X_test_LR(或根据需要执行分层Kfold) 这是代码示例

X_train = NN.predict_proba(X_train_LR)
X_test = NN.predict_proba(X_test_LR)
logistic = linear_model.LogisticRegression(C=1.0, penalty = 'l2')
logistic.fit(X_train,Y_train)
logistic.score(X_test,Y_test)

您考虑将其输出为新数据集,并在此新数据集上训练LR。

答案 1 :(得分:1)

我假设您的真实标签是Y_test,预测是predictions

根据预测roc_auc_score(average = 'micro')优化threshold似乎没有意义,因为根据预测的排名方式计算了AUC,因此需要predictions作为{{1}中的浮点值}。

因此,我将讨论[0,1]

您可以使用scipy.optimize.fmin

accuracy_score

答案 2 :(得分:1)

根据@cangrejo的答案:https://stats.stackexchange.com/a/310956/194535,假设模型的原始输出概率是向量v,然后可以定义先验分布:

π=(1 /θ1,1 /θ2,...,1 /θN),对于θi∈(0,1)和Σθi= 1,其中N是标记类别的总数,i是类别指数。

将v'=v⊙π作为模型的新输出概率,其中⊙表示逐元素乘积。

现在,您的问题可以重新表达为:从新的输出概率模型中找到优化指定指标(例如roc_auc_score)的π。一旦找到它,θs(θ1,θ2,...,θN)便是每个类别的最佳阈值。

代码部分:


  1. 创建一个proxyModel类,该类将原始模型对象作为参数并返回一个proxyModel对象。当您通过predict_proba()对象调用proxyModel时,它将根据您指定的阈值自动计算新概率:

    class proxyModel():
        def __init__(self, origin_model):
            self.origin_model = origin_model
    
        def predict_proba(self, x, threshold_list=None):
            # get origin probability
            ori_proba = self.origin_model.predict_proba(x)
    
            # set default threshold
            if threshold_list is None:
                threshold_list = np.full(ori_proba[0].shape, 1)
    
            # get the output shape of threshold_list
            output_shape = np.array(threshold_list).shape
    
            # element-wise divide by the threshold of each classes
            new_proba = np.divide(ori_proba, threshold_list)
    
            # calculate the norm (sum of new probability of each classes)
            norm = np.linalg.norm(new_proba, ord=1, axis=1)
    
            # reshape the norm
            norm = np.broadcast_to(np.array([norm]).T, (norm.shape[0],output_shape[0]))
    
            # renormalize the new probability
            new_proba = np.divide(new_proba, norm)
    
            return new_proba
    
        def predict(self, x, threshold_list=None):
            return np.argmax(self.predict_proba(x, threshold_list), axis=1)
    
  2. 实现评分功能:

    def scoreFunc(model, X, y_true, threshold_list):
        y_pred = model.predict(X, threshold_list=threshold_list)
        y_pred_proba = model.predict_proba(X, threshold_list=threshold_list)
    
        ###### metrics ######
        from sklearn.metrics import accuracy_score
        from sklearn.metrics import roc_auc_score
        from sklearn.metrics import average_precision_score
        from sklearn.metrics import f1_score
    
        accuracy = accuracy_score(y_true, y_pred)
        roc_auc = roc_auc_score(y_true, y_pred_proba, average='macro')
        pr_auc = average_precision_score(y_true, y_pred_proba, average='macro')
        f1_value = f1_score(y_true, y_pred, average='macro')
    
        return accuracy, roc_auc, pr_auc, f1_value
    
    
  3. 定义weighted_score_with_threshold()函数,该函数将阈值作为输入并返回加权分数:

    def weighted_score_with_threshold(threshold, model, X_test, Y_test, metrics='accuracy', delta=5e-5):
        # if the sum of thresholds were not between 1+delta and 1-delta, 
        # return infinity (just for reduce the search space of the minimizaiton algorithm, 
        # because the sum of thresholds should be as close to 1 as possible).
        threshold_sum = np.sum(threshold)
    
        if threshold_sum > 1+delta:
            return np.inf
    
        if threshold_sum < 1-delta:
            return np.inf
    
        # to avoid objective function jump into nan solution
        if np.isnan(threshold_sum):
            print("threshold_sum is nan")
            return np.inf
    
        # renormalize: the sum of threshold should be 1
        normalized_threshold = threshold/threshold_sum
    
        # calculate scores based on thresholds
        # suppose it'll return 4 scores in a tuple: (accuracy, roc_auc, pr_auc, f1)
        scores = scoreFunc(model, X_test, Y_test, threshold_list=normalized_threshold)    
    
        scores = np.array(scores)
        weight = np.array([1,1,1,1])
    
        # Give the metric you want to maximize a bigger weight:
        if metrics == 'accuracy':
            weight = np.array([10,1,1,1])
        elif metrics == 'roc_auc':
            weight = np.array([1,10,1,1])
        elif metrics == 'pr_auc':
            weight = np.array([1,1,10,1])
        elif metrics == 'f1':
            weight = np.array([1,1,1,10])
        elif 'all':
            weight = np.array([1,1,1,1])
    
        # return negatitive weighted sum (because you want to maximize the sum, 
        # it's equivalent to minimize the negative sum)
        return -np.dot(weight, scores)
    
  4. 使用优化算法differential_evolution()(最好是fmin)来找到最佳阈值:

    from scipy import optimize
    
    output_class_num = Y_test.shape[1]
    bounds = optimize.Bounds([1e-5]*output_class_num,[1]*output_class_num)
    
    pmodel = proxyModel(model)
    
    result = optimize.differential_evolution(weighted_score_with_threshold, bounds, args=(pmodel, X_test, Y_test, 'accuracy'))
    
    # calculate threshold
    threshold = result.x/np.sum(result.x)
    
    # print the optimized score
    print(scoreFunc(model, X_test, Y_test, threshold_list=threshold))