TfidfVectorizer的LabelPropagation精度低

时间:2018-09-25 18:01:37

标签: python scikit-learn tfidfvectorizer

我之前问过这个问题,现在已经关闭。因此,我需要再次询问。 我是计算机工程系的硕士研究生,正在努力学习Labelpropagation,而我的问题是关于Labelpropagation。

我有下面的代码,得分很低。我不明白问题出在哪里。我试图将LabelPropagation与TfIdfVectorizer一起使用。但是该代码存在问题。

问题是准确性低。结果约为%28,非常低。我们只有四个类别。我一直希望结果能够具有较高的准确性。我说的对吗?

有人可以帮助我吗?

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.semi_supervised import LabelPropagation

ratiolabeled = 0.5

categories = [
     'alt.atheism',
     'talk.religion.misc',
     'comp.graphics',
     'sci.space'
]


data_train = fetch_20newsgroups(subset='train', shuffle=True,           random_state=42, remove=('headers', 'footers', 'quotes'),
                                categories=categories)
data_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'),
                               categories=categories)

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.8,  stop_words='english')

y_train, y_test = data_train.target, data_test.target
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
labeled_indices, unlabeled_indices =    train_test_split(np.arange(len(y_train)), test_size=1-ratiolabeled,
                                                  random_state=43, stratify = y_train)
y_train[unlabeled_indices]=-1
lp_model = LabelPropagation(kernel='knn', n_neighbors=21, n_jobs=-1,  max_iter=20)
lp_model.fit(X_train.toarray(),y_train)

print("Accuracy = ", lp_model.score(X_train.toarray(),y_train))

0 个答案:

没有答案