随机森林:寻找相关特征

时间:2017-02-28 11:15:19

标签: python-2.7 machine-learning scikit-learn random-forest

我正在尝试在sklearn中训练RF模型进行分类。通过一组指定的特征向量,我获得的测试精度非常低。我假设我选择的特征向量误导了模型。所以我尝试了RFE,RFECV等来找到一组相关的特征向量 - 没有帮助提高准确性。我想出了一个简单的功能选择过程,如下所示>

ml_feats = #initial set of feature vector

while True
    feats_to_del=[]
    prev_score=0
    for feat_len in range(2,len(ml_feats)):
        classifier = RandomForestClassifier(**init_params)
        classifier.fit(X[ml_feats[:feat_len]],Y)
        score = classifier.score(Xt[ml_feats[:feat_len]],Yt)
        if score<prev_score:
             #feature that caused the score to decrease
             print ml_feats[feat_len]
             feat_to_del.append(ml_feats[feat_len])
        prev_score=score
    if len(feats_to_del)==0:
        break
    #delete irrelevant features
    ml_feats=list(set(ml_feats)-set(feats_to_del))

print ml_feats #print all relevant features

以上代码是否有助于找出正确的功能集? 感谢

1 个答案:

答案 0 :(得分:0)

您正在做的是贪婪的功能选择。如果要使用RandomForestClassifier选择要素,可以执行以下操作:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# xtrain : training data
# ytrain : training labels

clf = RandomForestClassifier()
sfm = SelectFromModel(estimator=clf, threshold='mean') # threshold of selection is mean of feature importances by random forest classifier
sfm.fit(xtrain, ytrain)
selected_xtrain = sfm.transform(xtrain)