Question

因此，基本上，我正在使用RF进行描述性建模，如下所示：

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight('balanced', np.unique(y), y)
class_weights = dict(enumerate(class_weights))
class_weights

{0：0.5561096747856852，1：4.955559597429368}

clf = RandomForestClassifier(class_weight=class_weights, random_state=0)

cross_val_score(clf, X, y, cv=10, scoring='f1').mean()

并将变量重要性绘制为：

import matplotlib.pyplot as plt

def plot_importances(clf, features, n):
    importances = clf.feature_importances_
    indices = np.argsort(importances)[::-1]

    if n:
        indices = indices[:n]

    plt.figure(figsize=(10, 5))
    plt.title("Feature importances")
    plt.bar(range(len(indices)), importances[indices], align='center')
    plt.xticks(range(len(indices)), features[indices], rotation=90)
    plt.xlim([-1, len(indices)])
    plt.show()

    return features[indices]

imp = plot_importances(clf, X.columns, 30)

我希望可变重要性在多次运行中相同。但是，每当我重新运行笔记本电脑时，它们的重要性都会改变。

我不明白为什么会这样。它是否与cross_val_score方法有关？

Answer 1

我无法重现该问题。对我来说，当我使用以下数据生成某些数据时，变量重要性在多次运行中的确保持不变：

X, y = make_classification(n_samples=1000,
                       n_features=10,
                       n_informative=3,
                       n_redundant=0,
                       n_repeated=0,
                       n_classes=2,
                       random_state=0,
                       shuffle=False)
X = pd.DataFrame(X)

通过选择前750个y / X数据点将数据更改为具有不均匀的权重也不会导致重要性上的差异。

您使用什么数据？

对sklearn中的random_state感到困惑

1 个答案: