CountVectorizer中的样本数量不一致

时间:2015-04-22 22:02:58

标签: python machine-learning scikit-learn

我正试图在我的一组推文上使用多项式朴素贝叶斯分类。

这是我的代码:

import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
testfile = 'test.txt'
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8')) ## Error here
tags = ['Pro_vax','Anti_vax','Neither']
mnb = MultinomialNB()
mnb.fit(trainset, tags)
codecs.open(testfile,'r','utf8')
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)
print results

文件train.txt中包含以下文字:

Vaccines are a very good idea.  They prevent all sorts of deadly diseases.
Vaccines cause autism.  Do not vaccinate your children
Going to read about vaccines.  Then, I am going to see my brother with autism.

我使用tags变量标记了它们。

文件test.txt包含以下文字:

Do not get your kids vaccinated.  Vaccination and autism are correlated.

当我运行脚本时,出现以下错误:

ValueError: Found arrays with inconsistent numbers of samples: [3 9]

我不熟悉这个错误。它是什么意思,我怎样才能防止它再次弹出?

0 个答案:

没有答案