我有一个带有评论的电子表格,我试图从中提取两条信息。
在所有点评和GenMgrCom中按旅行类型分组的最常用名词和动词,包括使用次数(年份和DOW,但我敢肯定我可以为这些修改代码)
在所有“评论”和GenMgrCom中使用的最常用的名词和动词,按所述评论的情绪分组。 (我什至不知道从何处开始,所以即使建议可能会产生结果的搜索词也将不胜感激)
我已将评论和GenMgrCom列串联为文本列,但是,下一步存在问题。
我正在尝试修改以下代码:
def remove_punctuation(text):
'''a function for removing punctuation'''
import string
# replacing the punctuations with no space,
# which in effect deletes the punctuation marks
translator = str.maketrans('', '', string.punctuation)
# return the text stripped of punctuation marks
return text.translate(translator)
ByTripType = text_reviews.groupby("Trip Type")
# word frequency by trip type
wordFreqByTripType = nltk.probability.ConditionalFreqDist()
# for each trip type...
for name, group in ByTripType:
sentences = group['text'].str.cat(sep = ' ')
# convert everything to lower case (so "The" and "the" get counted as
# the same word rather than two different words)
sentences = sentences.lower()
# split the text into individual tokens
tokens = nltk.tokenize.word_tokenize(sentences)
# calculate the frequency of each token
frequency = nltk.FreqDist(tokens)
# add the frequencies for each trip type to our dictionary
wordFreqByTripType[name] = (frequency)
# now we have an dictionary where each entry is the frequency distrobution
# of words for a specific trip type.
wordFreqByTripType.values()
输出:
dict_values([FreqDist({'the':1538,'。':1526,'and':1102,'to':828,',':812,'was':779,'a':652 ,'...':641,'i':544,'in':408,...}),FreqDist({'。':2465,'the':2391,'and':1657,'to ':1400,',':1167,'was':1161,'a':1018,'we':844,'in':600,'very':580,...}),FreqDist({ 。:1413,'the':1383,'and':974,'to':800,'was':735,',':604,'a':565,'非常':366,'我们' :352,'for':347,...}),FreqDist({'the':318,'。':271,'and':226,'?':199,'to':187,'是':184,',':153,'a':136,'we':106,'i':86,...}),FreqDist({'。':823,'the':759,' and':543,'was':493,'to':435,'i':390,',':371,'a':322,'in':206,'room':187,... })])
但是它并没有归类为旅行类型,我不确定如何为名词和动词添加过滤器。每次我尝试实现pos_tag时,都会收到一个期望字符串的错误消息,因为它当前是一个对象。但是,如果我可以仅提取名词和动词来解决这一问题,它也不会消除标点符号。
答案 0 :(得分:0)
Nltk的pos_tag
方法要求字符串可迭代,因此您需要pos标记,过滤掉不是名词或动词的单词,然后将列表传递给您的频率分布。所以,像这样。
tokens = nltk.tokenize.word_tokenize(sentences)
tagged_tokens = nltk.pos_tag(tokens)
nouns_and_verbs = [token[0] for token in tagged_tokens if token[1] in ['VBD', 'VBP', 'NN']]
frequency = nltk.FreqDist(nouns_and_verbs)
然后,您可以返回所需的每个组的前n个。
答案 1 :(得分:0)
谢谢,这就是我最终达到目的的结果。谢谢您的帮助
ByTripType = text_reviews.groupby("Trip Type")
def findtags(tag_prefix, tagged_text):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix))
return dict((tag, cfd[tag].most_common(10)) for tag in cfd.conditions())
for name, group in ByTripType:
sentences = group['text'].str.cat(sep = ' ')
sentences = sentences.lower()
remove_punctuation(sentences)
sentences = '"' + sentences + '"'
text = word_tokenize(sentences)
sentences = nltk.pos_tag(text)
for i in ('NN', 'VBP'):
tagdict = findtags(i, sentences)
print(name, tagdict)