Question

我有一个带有评论的电子表格，我试图从中提取两条信息。

在所有点评和GenMgrCom中按旅行类型分组的最常用名词和动词，包括使用次数（年份和DOW，但我敢肯定我可以为这些修改代码）
在所有“评论”和GenMgrCom中使用的最常用的名词和动词，按所述评论的情绪分组。（我什至不知道从何处开始，所以即使建议可能会产生结果的搜索词也将不胜感激）

我已将评论和GenMgrCom列串联为文本列，但是，下一步存在问题。

我正在尝试修改以下代码：

def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

ByTripType = text_reviews.groupby("Trip Type")

# word frequency by trip type
wordFreqByTripType = nltk.probability.ConditionalFreqDist()

# for each trip type...
for name, group in ByTripType:
    sentences = group['text'].str.cat(sep = ' ')

    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()

    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)

    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)

    # add the frequencies for each trip type to our dictionary
    wordFreqByTripType[name] = (frequency)

# now we have an dictionary where each entry is the frequency distrobution
# of words for a specific trip type. 
wordFreqByTripType.values()

输出：

dict_values（[FreqDist（{'the'：1538，'。'：1526，'and'：1102，'to'：828，'，'：812，'was'：779，'a'：652 ，'...'：641，'i'：544，'in'：408，...}），FreqDist（{'。'：2465，'the'：2391，'and'：1657，'to '：1400，'，'：1167，'was'：1161，'a'：1018，'we'：844，'in'：600，'very'：580，...}），FreqDist（{ 。：1413，'the'：1383，'and'：974，'to'：800，'was'：735，'，'：604，'a'：565，'非常'：366，'我们' ：352，'for'：347，...}），FreqDist（{'the'：318，'。'：271，'and'：226，'？'：199，'to'：187，'是'：184，'，'：153，'a'：136，'we'：106，'i'：86，...}），FreqDist（{'。'：823，'the'：759，' and'：543，'was'：493，'to'：435，'i'：390，'，'：371，'a'：322，'in'：206，'room'：187，... }）]）

但是它并没有归类为旅行类型，我不确定如何为名词和动词添加过滤器。每次我尝试实现pos_tag时，都会收到一个期望字符串的错误消息，因为它当前是一个对象。但是，如果我可以仅提取名词和动词来解决这一问题，它也不会消除标点符号。

Answer 1

Nltk的pos_tag方法要求字符串可迭代，因此您需要pos标记，过滤掉不是名词或动词的单词，然后将列表传递给您的频率分布。所以，像这样。

tokens = nltk.tokenize.word_tokenize(sentences)
tagged_tokens = nltk.pos_tag(tokens)
nouns_and_verbs = [token[0] for token in tagged_tokens if token[1] in ['VBD', 'VBP', 'NN']]
frequency = nltk.FreqDist(nouns_and_verbs)

然后，您可以返回所需的每个组的前n个。

Answer 2

谢谢，这就是我最终达到目的的结果。谢谢您的帮助

ByTripType = text_reviews.groupby("Trip Type")

def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(10)) for tag in cfd.conditions())

for name, group in ByTripType:
    sentences = group['text'].str.cat(sep = ' ')
    sentences = sentences.lower()
    remove_punctuation(sentences)
    sentences = '"' + sentences + '"'
    text = word_tokenize(sentences)
    sentences = nltk.pos_tag(text)
    for i in ('NN', 'VBP'):
        tagdict = findtags(i, sentences)
        print(name, tagdict)

使用numpy和NLTK

2 个答案: