这是一个python代码,用于查找令牌类型比率(代码中给出的所有定义)。我无法得到正确的价值。我怀疑我的逻辑是错误的,我无法调试我的逻辑。我将不胜感激任何帮助
def type_token_ratio(text):
"""
(list of str) -> float
Precondition: text is non-empty. Each str in text ends with \n and
text contains at least one word.
Return the Type Token Ratio (TTR) for this text. TTR is the number of
different words divided by the total number of words.
>>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'James Gosling\n']
>>> type_token_ratio(text)
0.8888888888888888
"""
x = 0
while x < len(text):
text[x] = text[x].replace('\n', '')
x = x + 1
index = 0
counter = 0
number_of_words = 0
words = ' '.join(text)
words = clean_up(words)
words = words.replace(',', '')
lst_of_words = words.split()
for word1 in lst_of_words:
while index < len(lst_of_words):
if word1 == lst_of_words[index]:
counter = counter + 1
index = index + 1
return ((len(lst_of_words) - counter)/len(lst_of_words))
答案 0 :(得分:1)
有一种更简单的方法 - 使用集合模块:
import collections
def type_token_ratio(text):
""" (list of str) -> float
Precondition: text is non-empty. Each str in text ends with \n and
text contains at m one word.
Return the Type Token Ratio (TTR) for this text. TTR is the number of
different words divided by the total number of words.
>>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
'James Gosling\n']
>>> type_token_ratio(text)
0.8888888888888888
"""
words = " ".join(text).split() # Give a list of all the words
counts = collections.Counter(words)
all = sum([counts[i] for i in counts])
unique = len(counts)
return float(unique)/all
或@Yoel指出 - 有一种更简单的方法:
def type_token_ratio(text):
words = " ".join(text).split() # Give a list of all the words
return len(set(words))/float(len(words))
答案 1 :(得分:0)
这里你可能想写的(从-for-开始替换你的代码)。
init_index=1
for word1 in lst_of_words:
index=init_index
while index < len(lst_of_words):
if word1 == lst_of_words[index]:
counter=counter+1
break
index = index + 1
init_index = init_index + 1
print word1
print counter
r=(float(len(lst_of_words) - counter))/len(lst_of_words)
print '%.2f' % r
return r
=&GT; index = init_index实际上是word1之后的单词索引;搜索总是在下一个单词重新开始。
=&GT; break:不计算多次相同的出现,一次出现迭代。
您正在搜索列表的其余部分是否存在与此重复的单词(因为此单词之前的迭代已完成)
应该注意不要重复多次出现,这就是休息的原因。如果同一个词有多个出现,则在下一次迭代时会发现下一个出现。
根据您的代码,不是防弹。