稀疏数据集上的谱聚类

时间:2016-01-19 09:31:19

标签: python scipy scikit-learn cluster-analysis spectral

我在数据集上应用谱聚类(sklearn.cluster.SpectralClustering),其中包含相对稀疏的一些特征。在Python中进行谱聚类时,我收到以下警告:

UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn("Graph is not fully connected, spectral embedding"

这通常会出现如下错误:

`
File "****.py", line 120, in perform_clustering_spectral_clustering
  predicted_clusters = cluster.SpectralClustering(n_clusters=n).fit_predict(features)
File "****\sklearn\base.py", line 349, in fit_predict
  self.fit(X)
File "****\sklearn\cluster\spectral.py", line 450, in fit
  assign_labels=self.assign_labels)
File "****\sklearn\cluster\spectral.py", line 256, in spectral_clustering
  eigen_tol=eigen_tol, drop_first=False)
File "****\sklearn\manifold\spectral_embedding_.py", line 297, in spectral_embedding
  largest=False, maxiter=2000)
File "****\scipy\sparse\linalg\eigen\lobpcg\lobpcg.py", line 462, in lobpcg
  activeBlockVectorBP, retInvR=True)
File "****\scipy\sparse\linalg\eigen\lobpcg\lobpcg.py", line 112, in _b_orthonormalize
  gramVBV = cholesky(gramVBV)
File "****\scipy\linalg\decomp_cholesky.py", line 81, in cholesky
  check_finite=check_finite)
File "****\scipy\linalg\decomp_cholesky.py", line 30, in _cholesky
  raise LinAlgError("%d-th leading minor not positive definite" % info)
numpy.linalg.linalg.LinAlgError: 9-th leading minor not positive definite
numpy.linalg.linalg.LinAlgError: 9-th leading minor not positive definite
numpy.linalg.linalg.LinAlgError: the leading minor of order 12 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.`

但是,当使用相同的设置时,并不总是会出现此警告/错误(即,其行为不一致,使其难以测试)。它出现在n_clusters的不同值上,但是对于值n = 2且n> 1,它更常发生。 7(至少是我的短暂经历;正如我所提到的,它的行为不是很一致)。

我应该如何应对此警告及相关错误?它取决于功能的数量吗?如果我添加更多内容怎么办

1 个答案:

答案 0 :(得分:1)

我也遇到过n_clusters这个问题。由于这是无监督的ML,因此n_clusters没有单一的正确值。在你的情况下,似乎n_cluster位于3和7之间。假设你有一些基本事实来聚类最好的处理方法是尝试几个n_cluster值来查看是否有任何模式出现给定数据集同时确保避免任何超过-配件。 您也可以使用剪影系数(sklearn.metrics.silhouette_score)