我正在尝试解决hackerrank中的一个问题,该问题确定给定类别'cfdconditions'和事件'cfdevents'的所有单词(小写字母和除去停用词)的条件频率分布。还要计算类别“ cfdconditions”和以“ ing”或“ ed”结尾的事件的条件频率分布。然后显示两种分布的频率模态。
我的代码是-
def calculateCFD(cfdconditions, cfdevents):
# Write your code here
from nltk.corpus import brown
from nltk import ConditionalFreqDist
from nltk.corpus import stopwords
stopword = set(stopwords.words('english'))
cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
cdev_cfd = [list(x) for x in cdev_cfd]
cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
inged_cfd = [list(x) for x in inged_cfd]
for wd in inged_cfd:
if wd[1].endswith('ing') and wd[1] not in stopword:
wd[1] = 'ing'
elif wd[1].endswith('ed') and wd[1] not in stopword:
wd[1] = 'ed'
inged_cfd = nltk.ConditionalFreqDist(inged_cfd)
b = inged_cfd.tabulate(cfdconditions, samples = ['ed','ing'])
return(a,b)
但是对于2个测试用例,结果仍然失败,我的输出是-
many years
adventure 24 32
fiction 29 44
science_fiction 11 16
ed ing
adventure 3281 1844
fiction 2943 1767
science_fiction 574 293
和
good bad better
adventure 39 9 30
fiction 60 17 27
mystery 45 13 29
science_fiction 14 1 4
ed ing
adventure 3281 1844
fiction 2943 1767
mystery 2382 1374
science_fiction 574 293
如果有人可以帮助我解决问题,那将是很有帮助的。
答案 0 :(得分:1)
尝试此代码,看看它是否有效。
from nltk.corpus import brown,stopwords
def calculateCFD(cfdconditions, cfdevents):
# Write your code here
stopword = set(stopwords.words('english'))
cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in brown.categories() for word in brown.words(categories=genre) if not word.lower() in stopword])
cdev_cfd.tabulate(conditions = cfdconditions, samples = cfdevents)
inged_cfd = [ (genre, word.lower()) for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
inged_cfd = [list(x) for x in inged_cfd]
for wd in inged_cfd:
if wd[1].endswith('ing') and wd[1] not in stopword:
wd[1] = 'ing'
elif wd[1].endswith('ed') and wd[1] not in stopword:
wd[1] = 'ed'
#print(inged_cfd)
inged_cfd = nltk.ConditionalFreqDist(inged_cfd)
#print(inged_cfd.conditions())
inged_cfd.tabulate(conditions=cfdconditions, samples = ['ed','ing'])
答案 1 :(得分:0)
单独计算cdev_cfd
,如下所示,请勿将其再次转换为列表。其余的代码看起来不错。
cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword])
答案 2 :(得分:0)
此未将cdev_cfd更改为list仍无法正常工作,这两个测试用例对我来说也都失败了,请有人帮忙
答案 3 :(得分:0)
请尝试以下代码。
stop=stopwords.words('english')
temp = [[genre, word.lower()] for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop]
cdev_cfd=nltk.ConditionalFreqDist(temp)
cdev_cfd.tabulate(conditions=cfdconditions,samples=cfdevents)
lst=[]
for i in temp:
if i[1].endswith('ing'):
lst.append((i[0],'ing'))
elif i[1].endswith('ed'):
lst.append((i[0],'ed'))
inged_cfd=nltk.ConditionalFreqDist(lst)
inged_cfd.tabulate(conditions=cfdconditions,samples=['ed','ing'])