将列名映射到随机林要素重要性

时间:2017-01-27 18:09:10

标签: python pandas

我正在尝试为随机森林模型绘制要素重要性,并将每个要素重要性映射回原始系数。我设法创建了一个显示重要性的图,并使用原始变量名作为标签,但现在它按照它们在数据集中的顺序排序变量名(而不是按重要性顺序排序)。如何按功能重要性排序?谢谢!

enter image description here

我的代码是:

importances = brf.feature_importances_
std = np.std([tree.feature_importances_ for tree in brf.estimators_],
         axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(x_dummies.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(figsize=(8,8))
plt.title("Feature importances")
plt.bar(range(x_train.shape[1]), importances[indices],
   color="r", yerr=std[indices], align="center")
feature_names = x_dummies.columns
plt.xticks(range(x_dummies.shape[1]), feature_names)
plt.xticks(rotation=90)
plt.xlim([-1, x_dummies.shape[1]])
plt.show()

4 个答案:

答案 0 :(得分:15)

一种通用的解决方案是将特征/重要性抛出到数据框中并在绘图之前对它们进行排序:

import pandas as pd
%matplotlib inline
#do code to support model
#"data" is the X dataframe and model is the SKlearn object

feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(data.columns, model.feature_importances_):
    feats[feature] = importance #add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance').plot(kind='bar', rot=45)

答案 1 :(得分:5)

我对Sam使用类似的解决方案:

import pandas as pd
important_features = pd.Series(data=brf.feature_importances_,index=x_dummies.columns)
important_features.sort_values(ascending=False,inplace=True)

我总是只使用print important_features打印列表,但为了绘图,您可以随时使用Series.plot

答案 2 :(得分:2)

另一种获取排序列表的简单方法

importances = list(zip(xgb_classifier.feature_importances_, df.columns))
importances.sort(reverse=True)

如果必要,下一代码会添加可视化

pd.DataFrame(importances, index=[x for (_,x) in importances]).plot(kind = 'bar')

答案 3 :(得分:1)

很简单,我是这样绘制的。

feat_importances = pd.Series(extraTree.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh')
plt.title("Top 15 important features")
plt.show()