Question

liblinear和nltk等机器学习包中的分类器提供了一个方法show_most_informative_features()，这对于调试功能非常有用：

viagra = None          ok : spam     =      4.5 : 1.0
hello = True           ok : spam     =      4.5 : 1.0
hello = None           spam : ok     =      3.3 : 1.0
viagra = True          spam : ok     =      3.3 : 1.0
casino = True          spam : ok     =      2.0 : 1.0
casino = None          ok : spam     =      1.5 : 1.0

我的问题是，如果scikit-learn中的分类器实现了类似的东西。我搜索了文档，但找不到类似的东西。

如果还没有这样的功能，有人知道如何获得这些值吗？

非常感谢！

Answer 1

分类器本身不记录要素名称，只看数字数组。但是，如果您使用Vectorizer / CountVectorizer / TfidfVectorizer / DictVectorizer，和提取功能，则使用线性模型（例如{ {1}}或Naive Bayes）然后您可以应用document classification example使用的相同技巧。示例（未经测试的，可能包含一两个错误）：

LinearSVC

这是用于多类分类;对于二进制的情况，我认为你应该只使用def print_top10(vectorizer, clf, class_labels): """Prints features with the highest coefficient values, per class""" feature_names = vectorizer.get_feature_names() for i, class_label in enumerate(class_labels): top10 = np.argsort(clf.coef_[i])[-10:] print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top10)))。您可能需要对clf.coef_[0]进行排序。

Answer 2

在larsmans代码的帮助下，我想出了二进制文件的代码：

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

Answer 3

要添加更新，RandomForestClassifier现在支持.feature_importances_属性。这个attribute告诉您该特征解释了多少观察到的方差。显然，所有这些值的总和必须<= 1。

我发现这个属性在执行特征工程时非常有用。

感谢scikit-learn团队和贡献者实现这一目标！

编辑：这适用于RandomForest和GradientBoosting。因此RandomForestClassifier，RandomForestRegressor，GradientBoostingClassifier和GradientBoostingRegressor都支持这一点。

Answer 4

我们最近发布了一个允许这样做的库（https://github.com/TeamHG-Memex/eli5）：它处理来自scikit-learn，二进制/多类情况的variuos分类器，允许根据特征值突出显示文本，与IPython集成，等

Answer 5

我实际上必须在我的NaiveBayes分类器上找到功能重要性，虽然我使用了上述功能，但我无法根据类获得功能重要性。我浏览了scikit learn的文档并稍微调整了上面的函数，发现它适用于我的问题。希望它也能帮到你！

def important_features(vectorizer,classifier,n=20):
class_labels = classifier.classes_
feature_names =vectorizer.get_feature_names()
topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
print("Important words in negative reviews")
for coef, feat in topn_class1:
    print(class_labels[0], coef, feat)
print("-----------------------------------------")
print("Important words in positive reviews")
for coef, feat in topn_class2:
    print(class_labels[1], coef, feat)

请注意，您的分类器（在我的情况下是它的NaiveBayes）必须具有属性feature_count_才能生效。

Answer 6

您还可以执行以下操作，按顺序创建重要性功能图表：

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
         axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
#print("Feature ranking:")


# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
   color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()

Answer 7

RandomForestClassifier还没有coef_ attrubute，但我认为它将在0.17版本中发布。但是，请参阅Recursive feature elimination on Random Forest using scikit-learn中的RandomForestClassifierWithCoef课程。这可能会为您提供一些解决上述限制的想法。

Answer 8

并不是您要找的东西，而是一种获取最大量值系数的快速方法（假设熊猫数据框列为特征名称）：

您像这样训练模型：

lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(df, Y, test_size=0.25)
lr.fit(X_train, y_train)

获取10个最大的负系数值（或对于最大的正数更改为reverse = True），例如：

sorted(list(zip(feature_df.columns, lr.coef_)), key=lambda x: x[1], 
reverse=False)[:10]

Answer 9

首先创建一个列表，然后给此列表添加名称标签。之后，提取所有功能名称和列名称，然后在标签列表中添加。在这里，我使用朴素贝叶斯模型。在朴素贝叶斯模型中，feature_log_prob_给出特征的概率。

def top20(model,label):

  feature_prob=(abs(model.feature_log_prob_))

  for i in range(len(feature_prob)):

    print ('top 20 features for {} class'.format(i))

    clas = feature_prob[i,:]

    dictonary={}

    for count,ele in enumerate(clas,0): 

      dictonary[count]=ele

    dictonary=dict(sorted(dictonary.items(), key=lambda x: x[1], reverse=True)[:20])

    keys=list(dictonary.keys())

    for i in keys:

      print(label[i])

    print('*'*1000)

如何获得scikit-learn分类器的最丰富的功能？

9 个答案: