使用numpy和NLTK

时间:2018-08-01 12:30:07

标签: numpy nlp jupyter-notebook nltk

我有一个带有评论的电子表格,我试图从中提取两条信息。

  1. 在所有点评和GenMgrCom中按旅行类型分组的最常用名词和动词,包括使用次数(年份和DOW,但我敢肯定我可以为这些修改代码)

  2. 在所有“评论”和GenMgrCom中使用的最常用的名词和动词,按所述评论的情绪分组。 (我什至不知道从何处开始,所以即使建议可能会产生结果的搜索词也将不胜感激)

我已将评论和GenMgrCom列串联为文本列,但是,下一步存在问题。

enter image description here

我正在尝试修改以下代码:

def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space, 
    # which in effect deletes the punctuation marks 
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

ByTripType = text_reviews.groupby("Trip Type")

# word frequency by trip type
wordFreqByTripType = nltk.probability.ConditionalFreqDist()

# for each trip type...
for name, group in ByTripType:
    sentences = group['text'].str.cat(sep = ' ')

    # convert everything to lower case (so "The" and "the" get counted as 
    # the same word rather than two different words)
    sentences = sentences.lower()

    # split the text into individual tokens    
    tokens = nltk.tokenize.word_tokenize(sentences)

    # calculate the frequency of each token
    frequency = nltk.FreqDist(tokens)

    # add the frequencies for each trip type to our dictionary
    wordFreqByTripType[name] = (frequency)

# now we have an dictionary where each entry is the frequency distrobution
# of words for a specific trip type. 
wordFreqByTripType.values()

输出:

  

dict_values([FreqDist({'the':1538,'。':1526,'and':1102,'to':828,',':812,'was':779,'a':652 ,'...':641,'i':544,'in':408,...}),FreqDist({'。':2465,'the':2391,'and':1657,'to ':1400,',':1167,'was':1161,'a':1018,'we':844,'in':600,'very':580,...}),FreqDist({ 。:1413,'the':1383,'and':974,'to':800,'was':735,',':604,'a':565,'非常':366,'我们' :352,'for':347,...}),FreqDist({'the':318,'。':271,'and':226,'?':199,'to':187,'是':184,',':153,'a':136,'we':106,'i':86,...}),FreqDist({'。':823,'the':759,' and':543,'was':493,'to':435,'i':390,',':371,'a':322,'in':206,'room':18​​7,... })])

但是它并没有归类为旅行类型,我不确定如何为名词和动词添加过滤器。每次我尝试实现pos_tag时,都会收到一个期望字符串的错误消息,因为它当前是一个对象。但是,如果我可以仅提取名词和动词来解决这一问题,它也不会消除标点符号。

2 个答案:

答案 0 :(得分:0)

Nltk的pos_tag方法要求字符串可迭代,因此您需要pos标记,过滤掉不是名词或动词的单词,然后将列表传递给您的频率分布。所以,像这样。

tokens = nltk.tokenize.word_tokenize(sentences)
tagged_tokens = nltk.pos_tag(tokens)
nouns_and_verbs = [token[0] for token in tagged_tokens if token[1] in ['VBD', 'VBP', 'NN']]
frequency = nltk.FreqDist(nouns_and_verbs)

然后,您可以返回所需的每个组的前n个。

答案 1 :(得分:0)

谢谢,这就是我最终达到目的的结果。谢谢您的帮助

ByTripType = text_reviews.groupby("Trip Type")

def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(10)) for tag in cfd.conditions())

for name, group in ByTripType:
    sentences = group['text'].str.cat(sep = ' ')
    sentences = sentences.lower()
    remove_punctuation(sentences)
    sentences = '"' + sentences + '"'
    text = word_tokenize(sentences)
    sentences = nltk.pos_tag(text)
    for i in ('NN', 'VBP'):
        tagdict = findtags(i, sentences)
        print(name, tagdict)