Question

我的神经网络输出是多标签分类的预测类概率表：

print(probabilities)

|   |      1       |      3       | ... |     8354     |     8356     |     8357     |
|---|--------------|--------------|-----|--------------|--------------|--------------|
| 0 | 2.442745e-05 | 5.952136e-06 | ... | 4.254002e-06 | 1.894523e-05 | 1.033957e-05 |
| 1 | 7.685694e-05 | 3.252202e-06 | ... | 3.617730e-06 | 1.613792e-05 | 7.356643e-06 |
| 2 | 2.296657e-06 | 4.859554e-06 | ... | 9.934525e-06 | 9.244772e-06 | 1.377618e-05 |
| 3 | 5.163169e-04 | 1.044035e-04 | ... | 1.435158e-04 | 2.807420e-04 | 2.346930e-04 |
| 4 | 2.484626e-06 | 2.074290e-06 | ... | 9.958628e-06 | 6.002510e-06 | 8.434519e-06 |
| 5 | 1.297477e-03 | 2.211737e-04 | ... | 1.881772e-04 | 3.171079e-04 | 3.228884e-04 |

我使用阈值（ 0.2 ）将其转换为类别标签，用于测量预测的准确性：

predictions = (probabilities > 0.2).astype(np.int)
print(predictions)

|   | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... |    0 |    0 |    0 |
| 1 | 0 | 0 | ... |    0 |    0 |    0 |
| 2 | 0 | 0 | ... |    0 |    0 |    0 |
| 3 | 0 | 0 | ... |    0 |    0 |    0 |
| 4 | 0 | 0 | ... |    0 |    0 |    0 |
| 5 | 0 | 0 | ... |    0 |    0 |    0 |

我也有一个测试仪：

print(Y_test)

|   | 1 | 3 | ... | 8354 | 8356 | 8357 |
|---|---|---|-----|------|------|------|
| 0 | 0 | 0 | ... |    0 |    0 |    0 |
| 1 | 0 | 0 | ... |    0 |    0 |    0 |
| 2 | 0 | 0 | ... |    0 |    0 |    0 |
| 3 | 0 | 0 | ... |    0 |    0 |    0 |
| 4 | 0 | 0 | ... |    0 |    0 |    0 |
| 5 | 0 | 0 | ... |    0 |    0 |    0 |

问题：：如何在Python中构建算法，该算法将选择最大化roc_auc_score(average = 'micro')或其他指标的最佳阈值？

也许可以在Python中构建手动函数来优化阈值，具体取决于准确性指标。

Answer 1

最好的方法是将逻辑回归放在新数据集的顶部。它将每个概率乘以某个常数，从而在输出上提供一个自动阈值（使用LR，您只需要预测类别而不是概率）

您需要通过将测试集一分为二来训练它，并在用NN预测输出后使用一部分训练LR。

这不是唯一的方法，但是每次都对我有效。

我们有X_train_nn，X_valid_nn，X_test_NN，然后将X_test_NN细分为X_train_LR，X_test_LR（或根据需要执行分层Kfold）这是代码示例

X_train = NN.predict_proba(X_train_LR)
X_test = NN.predict_proba(X_test_LR)
logistic = linear_model.LogisticRegression(C=1.0, penalty = 'l2')
logistic.fit(X_train,Y_train)
logistic.score(X_test,Y_test)

您考虑将其输出为新数据集，并在此新数据集上训练LR。

Answer 2

我假设您的真实标签是Y_test，预测是predictions。

根据预测roc_auc_score(average = 'micro')优化threshold似乎没有意义，因为根据预测的排名方式计算了AUC，因此需要predictions作为{{1}中的浮点值}。

因此，我将讨论[0,1]。

您可以使用scipy.optimize.fmin：

accuracy_score

Answer 3

根据@cangrejo的答案：https://stats.stackexchange.com/a/310956/194535，假设模型的原始输出概率是向量v，然后可以定义先验分布：

π=（1 /θ1，1 /θ2，...，1 /θN），对于θi∈（0,1）和Σθi= 1，其中N是标记类别的总数，i是类别指数。

将v'=v⊙π作为模型的新输出概率，其中⊙表示逐元素乘积。

现在，您的问题可以重新表达为：从新的输出概率模型中找到优化指定指标（例如roc_auc_score）的π。一旦找到它，θs（θ1，θ2，...，θN）便是每个类别的最佳阈值。

代码部分：

创建一个proxyModel类，该类将原始模型对象作为参数并返回一个proxyModel对象。当您通过predict_proba()对象调用proxyModel时，它将根据您指定的阈值自动计算新概率：

class proxyModel():
    def __init__(self, origin_model):
        self.origin_model = origin_model

    def predict_proba(self, x, threshold_list=None):
        # get origin probability
        ori_proba = self.origin_model.predict_proba(x)

        # set default threshold
        if threshold_list is None:
            threshold_list = np.full(ori_proba[0].shape, 1)

        # get the output shape of threshold_list
        output_shape = np.array(threshold_list).shape

        # element-wise divide by the threshold of each classes
        new_proba = np.divide(ori_proba, threshold_list)

        # calculate the norm (sum of new probability of each classes)
        norm = np.linalg.norm(new_proba, ord=1, axis=1)

        # reshape the norm
        norm = np.broadcast_to(np.array([norm]).T, (norm.shape[0],output_shape[0]))

        # renormalize the new probability
        new_proba = np.divide(new_proba, norm)

        return new_proba

    def predict(self, x, threshold_list=None):
        return np.argmax(self.predict_proba(x, threshold_list), axis=1)

实现评分功能：

def scoreFunc(model, X, y_true, threshold_list):
    y_pred = model.predict(X, threshold_list=threshold_list)
    y_pred_proba = model.predict_proba(X, threshold_list=threshold_list)

    ###### metrics ######
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import roc_auc_score
    from sklearn.metrics import average_precision_score
    from sklearn.metrics import f1_score

    accuracy = accuracy_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred_proba, average='macro')
    pr_auc = average_precision_score(y_true, y_pred_proba, average='macro')
    f1_value = f1_score(y_true, y_pred, average='macro')

    return accuracy, roc_auc, pr_auc, f1_value

定义weighted_score_with_threshold()函数，该函数将阈值作为输入并返回加权分数：

def weighted_score_with_threshold(threshold, model, X_test, Y_test, metrics='accuracy', delta=5e-5):
    # if the sum of thresholds were not between 1+delta and 1-delta, 
    # return infinity (just for reduce the search space of the minimizaiton algorithm, 
    # because the sum of thresholds should be as close to 1 as possible).
    threshold_sum = np.sum(threshold)

    if threshold_sum > 1+delta:
        return np.inf

    if threshold_sum < 1-delta:
        return np.inf

    # to avoid objective function jump into nan solution
    if np.isnan(threshold_sum):
        print("threshold_sum is nan")
        return np.inf

    # renormalize: the sum of threshold should be 1
    normalized_threshold = threshold/threshold_sum

    # calculate scores based on thresholds
    # suppose it'll return 4 scores in a tuple: (accuracy, roc_auc, pr_auc, f1)
    scores = scoreFunc(model, X_test, Y_test, threshold_list=normalized_threshold)    

    scores = np.array(scores)
    weight = np.array([1,1,1,1])

    # Give the metric you want to maximize a bigger weight:
    if metrics == 'accuracy':
        weight = np.array([10,1,1,1])
    elif metrics == 'roc_auc':
        weight = np.array([1,10,1,1])
    elif metrics == 'pr_auc':
        weight = np.array([1,1,10,1])
    elif metrics == 'f1':
        weight = np.array([1,1,1,10])
    elif 'all':
        weight = np.array([1,1,1,1])

    # return negatitive weighted sum (because you want to maximize the sum, 
    # it's equivalent to minimize the negative sum)
    return -np.dot(weight, scores)

使用优化算法differential_evolution()（最好是fmin）来找到最佳阈值：

from scipy import optimize

output_class_num = Y_test.shape[1]
bounds = optimize.Bounds([1e-5]*output_class_num,[1]*output_class_num)

pmodel = proxyModel(model)

result = optimize.differential_evolution(weighted_score_with_threshold, bounds, args=(pmodel, X_test, Y_test, 'accuracy'))

# calculate threshold
threshold = result.x/np.sum(result.x)

# print the optimized score
print(scoreFunc(model, X_test, Y_test, threshold_list=threshold))

如何为班级概率选择最佳阈值？

3 个答案: