如何使用潜在语义分析(lsa)在主题下聚类文档

时间:2016-06-14 17:39:48

标签: python cluster-analysis tf-idf lsa

我一直致力于潜在语义分析(lsa)并应用此示例:https://radimrehurek.com/gensim/tut2.html

它包含主题下的聚类术语,但无法找到我们如何在主题下聚类文档。

在那个例子中,它表示'根据LSI,“树”,“图”和“未成年人”都是相关词(并且对第一个主题的方向贡献最大),而第二个主题实际上与所有其他词语有关。正如预期的那样,前五个文档与第二个主题的关系更为密切,而剩下的四个文档与第一个主题相关联。

我们如何将这五个文档与Python代码关联到相关主题?

你可以在下面找到我的python代码。我将不胜感激任何帮助。

from numpy import asarray
from gensim import corpora, models, similarities

#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)

texts = [[word for word in text if word not in tokens_once] for text in texts]

dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]

tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]

# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)

corpus_lsi = lsi[corpus_tfidf]


#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
    print lsi.print_topics(i)

for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

1 个答案:

答案 0 :(得分:1)

corpus_lsi有一个包含9个向量的列表,这是文档的数量。 每个向量在其第i个索引处存储该文档属于主题i的可能性。 如果您只想将文档分配给1个主题,请选择向量中具有最高值的主题索引。