我最近参加了一场讨价还价的竞赛,并试图从scikit学习中运行线性CV模型。我知道有关堆栈溢出的类似问题,但我无法看到接受的回复如何与我的问题相关。任何帮助将不胜感激。我的代码如下:
train=pd.read_csv(".../train.csv")
test=pd.read_csv(".../test.csv")
data=pd.read_csv(".../sampleSubmission.csv")
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(max_features=None)
Y=transformer.fit_transform(train.tweet)
Z=transformer.transform(test.tweet)
from sklearn import linear_model
clf = linear_model.RidgeCV()
a=4
b=1
while (a<28):
clf.fit(Y, train.ix[:,a])
pred=clf.predict(Z)
linpred=pd.DataFrame(pred)
data[data.columns[b]]=linpred
b=b+1
a=a+1
print b
我收到的错误总共粘贴在下面:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-41c31233c15c> in <module>()
1 blah=train.ix[:,a]
----> 2 clf.fit(Y, blah)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
815 gcv_mode=self.gcv_mode,
816 store_cv_values=self.store_cv_values)
--> 817 estimator.fit(X, y, sample_weight=sample_weight)
818 self.alpha_ = estimator.alpha_
819 if self.store_cv_values:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
722 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
723
--> 724 v, Q, QT_y = _pre_compute(X, y)
725 n_y = 1 if len(y.shape) == 1 else y.shape[1]
726 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)
607 def _pre_compute(self, X, y):
608 # even if X is very sparse, K is usually very dense
--> 609 K = safe_sparse_dot(X, X.T, dense_output=True)
610 v, Q = linalg.eigh(K)
611 QT_y = np.dot(Q.T, y)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
76 from scipy import sparse
77 if sparse.issparse(a) or sparse.issparse(b):
---> 78 ret = a * b
79 if dense_output and hasattr(ret, "toarray"):
80 ret = ret.toarray()
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)
301 if self.shape[1] != other.shape[0]:
302 raise ValueError('dimension mismatch')
--> 303 return self._mul_sparse_matrix(other)
304
305 try:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)
518
519 nnz = indptr[-1]
--> 520 indices = np.empty(nnz, dtype=np.intc)
521 data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))
522
ValueError: negative dimensions are not allowed
答案 0 :(得分:2)
看起来这个问题不使用sklearn就会发生。它在scipy.sparse矩阵乘法中。在一个scipy用户板上有这个问题:sparse matrix multiplication problem。问题的关键在于scipy在稀疏矩阵乘法期间使用32位int作为非零索引。这是上面追溯底部的标记线。如果有太多非零元素,那可能会溢出。该溢出导致变量nnz变为负数。然后,最后一个箭头处的代码创建一个大小为nnz的空数组,由于负维度而导致ValueError。
您可以在不使用sklearn的情况下生成上面回溯的尾端,如下所示:
import scipy.sparse as ss
X = ss.rand(75000, 42000, format='csr', density=0.01)
X * X.T
对于这个问题,输入可能非常稀疏,但是RidgeCV看起来像是在sklearn中的回溯的最后部分中乘以X和X.T。该产品可能不够稀疏。