我试图根据WordNet计算名词,动词,形容词和副词的平均多义词。 这是我定义的功能:
def averagePolysemy(synsets):
allSynsets = list(wn.all_synsets(synsets))
lemmas = [synset.lemma_names() for synset in allSynsets]
senseCount = 0
for lemma in lemmas:
senseCount = senseCount + len(wn.synsets(lemma, synsets))
return senseCount/len(allSynsets)
averagePolysemy(wn.NOUN)
当我打电话给我时,我收到错误:
Traceback (most recent call last):
File "<ipython-input-214-345e72500ae3>", line 1, in <module>
averagePolysemy(wn.NOUN)
File "<ipython-input-213-616cc4af89d1>", line 6, in averagePolysemy
senseCount = senseCount + len(wn.synsets(lemma, synsets))
File "/Users/anna/anaconda/lib/python3.6/site-
packages/nltk/corpus/reader/wordnet.py", line 1483, in synsets
lemma = lemma.lower()
AttributeError: 'list' object has no attribute 'lower'e 'lower'
我不确定我的函数的哪个部分导致了这个错误。
答案 0 :(得分:1)
首先,让我们定义什么是多义词。
一词多义:单词或短语的许多可能意义的共存。
(资料来源:https://www.google.com/search?q=polysemy)
来自Wordnet:
WordNet®是一个庞大的英语词汇数据库。名词,动词,形容词和副词被分组为认知同义词集(同义词集),每个表达一个不同的概念。通过概念 - 语义和词汇关系,同义词是相互关联的。
在WordNet中,我们应该熟悉几个术语:
Synset :一个独特的概念/含义
引理:单词的根形式
词性(POS):单词的语言类别
Word :单词的表面形式(表面单词不在WordNet中)
(注意:@alexis在lemma vs synset
上有一个很好的答案:https://stackoverflow.com/a/42050466/610569;另请参阅https://stackoverflow.com/a/23715743/610569和https://stackoverflow.com/a/29478711/610569)
在代码中:
from nltk.corpus import wordnet as wn
# Given a word "run"
word = 'run'
# We get multiple meaning (i.e. synsets) for
# the word in wordnet.
for synset in wn.synsets(word):
# Each synset comes with an ID.
offset = str(synset.offset()).zfill(8)
# Each meaning comes with their
# linguistic category (i.e. POS)
pos = synset.pos()
# Usually, offset + POS is the way
# to index a synset.
idx = offset + '-' + pos
# Each meaning also comes with their
# distinct definition.
definition = synset.definition()
# For each meaning, there are multiple
# root words (i.e. lemma)
lemmas = ','.join(synset.lemma_names())
print ('\t'.join([idx, definition, lemmas]))
[OUT]:
00189565-n a score in baseball made by a runner touching all four bases safely run,tally
00791078-n the act of testing something test,trial,run
07460104-n a race run on foot footrace,foot_race,run
00309011-n a short trip run
01926311-v move fast by using one's feet, with one foot off the ground at any given time run
02075049-v flee; take to one's heels; cut and run scat,run,scarper,turn_tail,lam,run_away,hightail_it,bunk,head_for_the_hills,take_to_the_woods,escape,fly_the_coop,break_away
回到问题,如何根据WordNet“计算名词,动词,形容词和副词的平均多义词”?
由于我们正在使用WordNet,表面单词不在意,我们只留下了lemmas。
首先,我们需要定义名词,动词,形容词中的词条。
from nltk.corpus import wordnet as wn
from collections import defaultdict
words_by_pos = defaultdict(set)
for synset in wn.all_synsets():
pos = synset.pos()
for lemma in synset.lemmas():
words_by_pos[pos].add(lemma)
但这是对lemmas与POS之间关系的简单看法:
# There are 5 POS.
>>> words_by_pos.keys()
dict_keys(['a', 's', 'r', 'n', 'v'])
# Some words have multiple POS tags =(
>>> len(words_by_pos['n'])
119034
>>> len(words_by_pos['v'])
11531
>> len(words_by_pos['n'].intersection(words_by_pos['v']))
4062
让我们看看我们是否可以忽略它并继续前进:
# Lets look that the verb 'v' category
num_meanings_per_verb = []
for word in words_by_pos['v']:
# No. of meaning for a word given a POS.
num_meaning = len(wn.synsets(word, pos='v'))
num_meanings_per_verb.append(num_meaning)
print(sum(num_meanings_per_verb) / len(num_meanings_per_verb))
[OUT]:
2.1866273523545225
这个数字是什么意思? (如果它意味着什么)
这意味着
也许,它有一些意义,但是如果我们看一下num_meanings_per_verb
中的值的计数:
Counter({1: 101168,
2: 11136,
3: 3384,
4: 1398,
5: 747,
6: 393,
7: 265,
8: 139,
9: 122,
10: 85,
11: 74,
12: 39,
13: 29,
14: 10,
15: 19,
16: 10,
17: 6,
18: 2,
20: 5,
26: 1,
30: 1,
33: 1})