Question

我拿了一堆文件并为所有文件中的每个标记计算了tf * idf，并为每个文档创建了向量（每个n维，n是语料库中唯一的单词的数量）。我无法弄清楚如何使用sklearn.cluster.MeanShift

从向量创建集群

Answer 1

TfidfVectorizer将文档转换为＆＃34;稀疏矩阵＆＃34;数字。 MeanShift要求传递给它的数据是密集的＆＃34;。下面，我将展示如何在管道中转换它（credit）但是，在内存允许的情况下，您可以使用toarray()或todense()将稀疏矩阵转换为密集矩阵。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MeanShift
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

documents = ['this is document one',
             'this is document two',
             'document one is fun',
             'document two is mean',
             'document is really short',
             'how fun is document one?',
             'mean shift... what is that']

pipeline = Pipeline(
  steps=[
    ('tfidf', TfidfVectorizer()),
    ('trans', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
    ('clust', MeanShift())
  ])

pipeline.fit(documents)
pipeline.named_steps['clust'].labels_

result = [(label,doc) for doc,label in zip(documents, pipeline.named_steps['clust'].labels_)]

for label,doc in sorted(result):
  print(label, doc)

打印：

0 document two is mean
0 this is document one
0 this is document two
1 document one is fun
1 how fun is document one?
2 mean shift... what is that
3 document is really short

你可以修改＆＃34;超参数＆＃34;但是这给了我一个大概的想法。

使用Mean Shift进行文档聚类

1 个答案: