Question

我正在将sklearn的RandomForestClassifier用于具有高度不平衡类的数据–大量0而很少的1。我对预测中1的数量感兴趣。示例（credit）：

# Load libraries
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Make class highly imbalanced by removing first 40 observations
X = X[46:,:]
y = y[46:]
# Create target vector indicating if class 0, otherwise 1
y = np.where((y == 0), 1, 0)
#split into training and testing
trainx = X[::2]
trainy = y[::2]
testx = X[1::2]
testy = y[1::2]
# Create decision tree classifer object
clf = RandomForestClassifier()
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())

这将返回2。这很好，除了我的真实数据外，结果比真实答案要低一点。我想使用class_weight参数来解决这个问题。但是，当我这样做时：

clf = RandomForestClassifier(class_weight="balanced") 
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())

我得到的结果为0。如果使用class_weight={1:10}，结果相同。如果我使用class_weight={1:.1}，我将再次得到2。

我在真实数据上也得到类似的行为：对类1的权重越高，则在预测中得到的1越少。

这与我期望的行为相反（与class_weight参数在svm中的行为相反）。这里发生了什么？ This question建议sklearn通过某种默认值来分配类标签，但这似乎很奇怪。为什么不使用我给它的类标签？

随机森林中的sklearn class_weight与我期望的相反

0 个答案: