Question

我正在尝试打印LDA中每个主题的主题和文本。但是在打印主题后无提示会干扰我的脚本。我可以打印主题，但不能打印文本。

import pandas
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_top_words = 5
n_components = 5

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])

        return message

text = pandas.read_csv('text.csv', encoding = 'utf-8')
text_list = text.values.tolist()

tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_list)

lda = LatentDirichletAllocation(n_components=n_components, learning_method='batch', max_iter=25, random_state=0)

doc_distr = lda.fit_transform(tf)

tf_feature_names = tf_vectorizer.get_feature_names()
print (print_top_words(lda, tf_feature_names, n_top_words))

doc_distr = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
    print ("Topic {}:".format(i))
    docs = np.argsort(doc_distr[:, i])[::-1]
    for j in docs[:10]:
       print (" ".join(text_list[j].split(",")[:2]))

我的输出：

Topic 0: no order mail received back 

Topic 1: cancel order wishes possible wish 

Topic 2: keep current informed delivery order 

Topic 3: faulty wooden box present side 

Topic 4: delivered received be produced urgent 

Topic 5: good waiting day response share

随后出现此错误：

  File "lda.py", line 41, in <module>

    for i in range(len(topics)):

TypeError: object of type 'NoneType' has no len()

Answer 1

dput()函数（至少）存在四个问题。

第一个-导致当前问题的原因是-如果my_tibble为空，则for循环的主体将不执行，然后您的函数将（隐式）返回my_tibble <- structure(list(fruit = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("Apple", "Banana", "Orange", "Strawberry"), class = "factor"), length = c(0.530543135476024, 0.488977737310336, 0.503193533328075, 0.560337485188931, 0.533439933009971, 0.611517111445543, 0.784118643975375, 0.362563771715571, 0.999994359802019, 0.956308812233702, 0.332481969543643, 0.562729609348448, 0.635908731579197, 0.565161511593215, 0.526448727581439, 0.429069715902935, 0.460919459557728, 0.444385050459595, 0.503366669668819, 0.618141816193079, 0.516525710744663, 0.481938965057342, 0.505085048888451, 0.457048653556098, 0.536921608675353, 0.511397571854412, 0.442487815464855, 0.50103115023886, 0.305442471161553, 0.424241364519466, 2.45596087585689e-09, 0.122698840602406, 0.131431902209926, 0.205210819820745, 0.154445620769804, 0.161286627937974), weight = c(0.0729778030869548, 0.0460942475327506, 0.0796304213241703, 0.0732813711244074, 0.0882995825748408, 0.127183436952234, 0.0670534170610057, 0.0622813564507915, 0.0290840877242033, 0.0283807418126428, 0.107361724942771, 0.119133737366527, 0.185844270761176, 0.108155205104857, 0.189750275168087, 0.0845939609954818, 0.146490609941214, 0.14150784543994, 0.122840037806175, 0.143552891056291, 0.16798564927051, 0.241024152676673, 0.237508762873311, 0.20455939607561, 0.316350856257808, 0.30730862083812, 0.184386251393058, 0.181923008217247, 0.332024894278287, 0.194530111145869, 0.0166977795512452, 0.0569762924658561, 0.0739793228272142, 0.0433330479654348, 0.099781312832018, 0.0396375225550451), length_sd = c(0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.21053610140121, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.0933430177635132, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.067296241260161, 0.0695477116271205, 0.0695477116271205, 0.0695477116271205, 0.0695477116271205, 0.0695477116271205, 0.0695477116271205), weight_sd = c(0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.0292441784658992, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.033755823218546, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0611975080850528, 0.0290125579882519, 0.0290125579882519, 0.0290125579882519, 0.0290125579882519, 0.0290125579882519, 0.0290125579882519 )), class = c("grouped_df", "tbl_df", "tbl", "data.frame" ), row.names = c(NA, -36L), vars = "fruit", labels = structure(list( fruit = structure(1:4, .Label = c("Apple", "Banana", "Orange", "Strawberry"), class = "factor")), class = "data.frame", row.names = c(NA, -4L), vars = "fruit", drop = TRUE), indices = list(0:9, 20:29, 10:19, 30:35), drop = TRUE, group_sizes = c(10L, 10L, 10L, 6L), biggest_group_size = 10L)。

第二个更微妙：如果print_top_words()不为空，则该函数将仅返回第一条消息，然后返回并退出-model.components_语句的定义：返回值（如果未指定值，则返回None）并退出该函数。

第三个问题是（当model.components_不为空时），该函数返回一个字符串，其中调用代码显然需要一个列表。这是一个细微的错误，因为字符串具有长度，因此return上的for循环似乎可以正常工作，但是None肯定不是您期望的值。

最后，该函数的名称非常错误，因为它不会“打印”任何内容-与前三个问题相比，这似乎微不足道，并且不会阻止代码的确起作用（假设前三个问题是固定），但是代码推理本身就很困难，因此正确命名很重要，因为它可以大大减少认知负担并简化维护/调试工作。

长话短说：考虑一下您真正希望此功能执行的操作并适当地对其进行修复。由于我不确定您要做什么，因此我不会在此处发布“更正”的版本，但是以上说明应该会有所帮助。

NB：同样，您使用完全相同的参数调用model.components_和range(len(topics))两次，这完全没有用，纯粹浪费了处理器周期（在最佳情况下）或发出了气味如果您从第二次调用中获得了不同的结果，则会发现另一个错误。

Answer 2

您没有提供完整的代码，但是最可能的原因是变量topics为None。唯一可能发生的方法是，如果model.components_函数中的print_top_words是一个空集合，则该循环永远不会运行，并且该函数（隐式）返回None。检查集合的值。更好的是，选择在这种情况下要返回的值。

另一个无关的要点：您在每次迭代中初始化message变量，然后在每次迭代时将其返回。检查你的意思。

Answer 3

如果不了解LatentDirichletAllocation的内部工作原理，这将很难回答。但是，它与components_有关，因为它的重复迭代会产生不同的结果。

您很可能可以通过更改以下内容来避免此错误：

print (print_top_words(lda, tf_feature_names, n_top_words))

doc_distr = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)

收件人：

temp = print_top_words(lda, tf_feature_names, n_top_words)
print (temp)

doc_distr = lda.fit_transform(tf)
topics = print_top_words(temp)

第二次调用该函数时，model.components_不返回任何内容，因此跳过了循环，该函数不返回任何内容。

但是，我不确定这是否是代码的实际意图。看起来您可能希望print_top_words成为生成器？您将在for循环内返回，从而使其永远不会达到第二次迭代。这可能不是循环的目的。

在函数中返回None：TypeError：类型为'NoneType'的对象没有len（）

3 个答案: