pyspark LDA获得主题词

时间:2018-11-26 10:15:42

标签: apache-spark pyspark lda topic-modeling

我正在尝试运行LDA。我没有将其应用于文字和文档,而是出现了错误消息和错误原因。每行都是错误,每列都是错误原因。如果错误原因有效,则单元为1;如果错误原因无效,则单元为0。 现在,我正在尝试为每个创建的主题(此处等同于错误模式)获取错误原因名称(而不仅仅是索引)。到目前为止,我拥有的似乎可以正常工作的代码如下

# VectorAssembler combines all columns into one vector
assembler = VectorAssembler(
    inputCols=list(set(df.columns) - {'error_ID'}),
    outputCol="features")
lda_input = assembler.transform(df)

# Train LDA model
lda = LDA(k=5, maxIter=10, featuresCol= "features")
model = lda.fit(lda_input)

# A model with higher log-likelihood and lower perplexity is considered to be good.
ll = model.logLikelihood(lda_input)
lp = model.logPerplexity(lda_input)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(7)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(lda_input)
print(transformed.show(truncate=False))

我的输出是:

enter image description here for each row 基于https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda,我添加了该部分,该部分不起作用:

 topics = model.topicsMatrix()
    for topic in range(10):
        print("Topic " + str(topic) + ":")
        for word in range(0, model.vocabSize()): 
            print(" " + str(topics[word][topic]))

现在如何获得最常见的错误原因/查找与术语索引相对应的列?

1 个答案:

答案 0 :(得分:0)

为了遍历DenseMatrix,您需要将其转换为数组。 这应该不会出错。但是我不确定打印结果,因为它取决于您的数据。

topn_words = 10
num_topics = 10

topics = model.topicsMatrix().toArray()
for topic in range(num_topics):
    print("Topic " + str(topic) + ":")
    for word in range(0, topn_words): 
        print(" " + str(topics[word][topic]))