Question

Gensim有一个教程，说明如何在给定文档/查询字符串的情况下，按降序说明其他文档与其最相似的内容：

http://radimrehurek.com/gensim/tut3.html

它还可以显示与整个模型完全相关的主题：

How to print the LDA topics models from gensim? Python

但是，您如何找到主题与给定文档/查询字符串相关联的内容？理想情况下，每个主题都有一些数字相似性指标？我还没能找到任何东西。

Answer 1

如果您想查找看不见的文档的主题分布，那么您需要将感兴趣的文档转换为一个单词表示

from gensim import utils, models
from gensim.corpora import Dictionary
lda = models.LdaModel.load('saved_lda.model') # load saved model
dictionary = Dictionary.load('saved_dictionary.dict') # load saved dict
text = ' '
with open('document', 'r') as inp: # convert file to string
    for line in inp:
        text += line + ' '
tkn_doc = utils.simple_preprocess(text) # filter & tokenize words
doc_bow = dictionary.doc2bow(tkn_doc) # use dictionary to create bow
doc_vec = lda[doc_bow] # this is the topic probability distribution for the document of interest

从这段代码中你得到一个稀疏向量，其中索引代表主题0 .... n和每个＆＃39;权重＆＃39;是文档中的单词属于模型中该主题的概率。您可以使用matplotlib创建条形图来可视化分布。

y_axis = []
x_axis = []
for topic_id, dist in enumerate(doc_vec):
    x_axis.append(topic_id + 1)
    y_axis.append(dist)
width = 1 
plt.bar(x_axis, y_axis, width, align='center', color='r')
plt.xlabel('Topics')
plt.ylabel('Probability')
plt.title('Topic Distribution for doc')
plt.xticks(np.arange(2, len(x_axis), 2), rotation='vertical', fontsize=7)
plt.subplots_adjust(bottom=0.2)
plt.ylim([0, np.max(y_axis) + .01])
plt.xlim([0, len(x_axis) + 1])
plt.savefig(output_path)
plt.close()

如果您想查看每个主题中的topn术语，可以print them like this。引用图形，您可以查找打印的topn单词并确定模型如何解释文档。您还可以通过使用矢量计算（如hellinger distance，euclidean，jensen shannon等）找到两个不同文档概率分布向量之间的距离。

在Gensim中显示与文档/查询关联的主题

1 个答案: