NLTK计数子短语的频率

时间:2015-08-06 07:48:25

标签: python nltk

对于这句话:“我看到外面有一棵高大的树。一个人在高大的树下”

如何计算tall tree的频率?我可以在搭配中使用二元组,例如

bgs= nltk.bigrams(tokens)
fdist1= nltk.FreqDist(bgs)
pairs = fdist1.most_common(500)

但我只需要计算一个特定的子短语。

2 个答案:

答案 0 :(得分:2)

@ uday1889的回答有一些缺陷:

>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> string.count("tall tree")
2
>>> string = "The see a stall tree outside. A man is under the tall trees"
>>> string.count("tall tree")
2
>>> string = "I would like to install treehouses at my yard"
>>> string.count("tall tree")
1

一个廉价的黑客将填补str.count()

中的空格
>>> string = "I would like to install treehouses at my yard"
>>> string.count("tall tree")
1
>>> string.count(" tall tree ")
0
>>> string = "The see a stall tree outside. A man is under the tall trees"
>>> string.count(" tall tree ")
0
>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> string.count(" tall tree ")
1

但正如您所看到的,当子字符串位于句子的开头或结尾或标点符号旁边时会出现一些问题。

>>> from nltk.util import ngrams
>>> from nltk import word_tokenize
>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> len([i for i in ngrams(word_tokenize(string),n=2) if i==('tall', 'tree')])
2
>>> string = "I would like to install treehouses at my yard"
>>> len([i for i in ngrams(word_tokenize(string),n=2) if i==('tall', 'tree')])
0

答案 1 :(得分:1)

count()方法应该这样做:

string = "I see a tall tree outside. A man is under the tall tree"
string.count("tall tree")