我无法使用随机森林输出最佳选择的功能?

时间:2018-03-17 16:42:14

标签: python pandas scikit-learn random-forest

我制作了随机森林分类器,其阈值= 0.15但是当我尝试迭代所选模型时,它不会输出最佳选择的特征。

代码:

X = data.loc[:,'IFATHER':'VEREP']
y = data.loc[:,'Criminal']

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

    # Split the data into 30% test and 70% training
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

# Print the name and gini importance of each feature
for feature in zip(X, clf.feature_importances_):
    print(feature)

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)

# Train the selector
sfm.fit(X_train, y_train)

以下代码不起作用:

# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(X[feature_list_index])

我可以使用随机森林分类器但不使用阈值来获取每个要素的特征重要性。我认为get_support()不是正确的方法。

截图:

enter image description here

1 个答案:

答案 0 :(得分:1)

创建包含最重要功能的新X数据集:

X_selected_features = sfm.fit_transform(X_train, y_train)

要查看功能名称:

features = np.array(list_of_feature_names)
print(features[sfm.get_support()])

如果X是Pandas.DataFrame:

features = X.columns.values