ValueError:找到样本数不一致的数组:[1 25]

时间:2016-06-12 09:44:44

标签: text scikit-learn cross-validation

我正在对我的文章进行分类测试,并希望在CV中执行一次,但是在我运行我的函数后,我得到一个错误,我不太明白。这是我的代码:

import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

bunch = load_files('corpus')

X = bunch.data
y = bunch.target

count_vect = CountVectorizer(stop_words = 'english')
X_counts = count_vect.fit_transform(X)

tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)

classifier = MultinomialNB().fit(X_tfidf, y)

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold cross validation iterator of k=N folds
    cv = KFold(len(y), K, shuffle=False, random_state=0)
    # by default the score used is the one returned by score >>> method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print scores
    print ("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores))

clfs = [classifier]

for clf in clfs:
     evaluate_cross_validation(clf, X, y, 26)

但是,我收到此错误,我不明白发生了什么:

Traceback (most recent call last):

  File "<ipython-input-289-8f2a0d6aa294>", line 4, in <module>
    evaluate_cross_validation(clf, X, y, 26)

  File "<ipython-input-287-ecb52eb2fc76>", line 5, in evaluate_cross_validation
    scores = cross_val_score(clf, X, y, cv=cv)

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1433, in cross_val_score
    for train, test in cv)

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
    self.results = batch()

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 527, in fit
    X, y = check_X_y(X, y, 'csr')

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 520, in check_X_y
    check_consistent_length(X, y)

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 176, in check_consistent_length
    "%s" % str(uniques))

ValueError: Found arrays with inconsistent numbers of samples: [ 1 25]

帮助将不胜感激。

感谢

EDITED

我得到y.shape的值但是对于X.shape只是一个错误。

X.shape
Traceback (most recent call last):

  File "<ipython-input-305-270dd209b8a9>", line 1, in <module>
    X.shape

AttributeError: 'list' object has no attribute 'shape'

y.shape
Out[306]: (26,)

EDITED II:

是的,我正在处理26篇文章。我尝试用我的代码做的是将文章转换为tfidf表示,然后将它用于我的分类器。但是,似乎我的classifier和我的def evaluate_cross_validation(clf, X, y, K)功能之间存在断开连接。我正在尝试leave-one-out CV。我的X_tfidf.shape如下:

X_tfidf.shape
Out[17]: (26, 3777)

编辑III:

我想使用这种管道方法,但在我看来它并没有tfidf矢量化文章:

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

无法弄清楚如何修改此管道分类器以执行tfidf矢量化。 :S

0 个答案:

没有答案