Question

我想知道如何使用Counter（）来计算带有列表traning_data的unigram，bigram，cooc和wordcount。

我是蟒蛇新人，请耐心等待。谢谢！

您需要实施HMM postagger的两个部分。

HMM模型

维特比解码这是代码：

from collections import Counter
from math import log

class HMM(object):
    def __init__(self, epsilon=1e-5, training_data=None):
        self.epsilon = epsilon
        if training_data is not None:
            self.fit(training_data)
def fit(self, training_data):
'''
Counting the number of unigram, bigram, cooc and wordcount from the training
data.

Parameters
----------
training_data: list
    A list of training data, each element is a tuple with words and postags.
'''
self.unigram = Counter()    # The count of postag unigram, e.g. unigram['NN']=5
self.bigram = Counter()     # The count of postag bigram, e.g. bigram[('PRP', 'VV')]=1
self.cooc = Counter()       # The count of word, postag, e.g. cooc[('I', 'PRP')]=1
self.wordcount = Counter()  # The count of word, e.g. word['I']=1

print('building HMM model ...')
for words, tags in training_data:
    # Your code here! You need to implement the ngram counting part. Please count
    # - unigram
    # - bigram
    # - cooc
    # - wordcount

print('HMM model is built.')
self.postags = [k for k in self.unigram]

这是training_dataset，预期结果如下：

    # The tiny example.
    training_dataset = [(['dog', 'chase', 'cat'], ['NN', 'VV', 'NN']),
                (['I', 'chase', 'dog'], ['PRP', 'VV', 'NN']),
                (['cat', 'chase', 'mouse'], ['NN', 'VV', 'NN'])
               ]

    hmm = HMM(training_data=training_dataset)

    # Testing if the parameter are correctly estimated.
    assert hmm.unigram['NN'] == 5
    assert hmm.bigram['VV', 'NN'] == 3
    assert hmm.bigram['NN', 'VV'] == 2
    assert hmm.cooc['dog', 'NN'] == 2

Answer 1

将Counter()与列表结合使用非常简单。 Counter.update()正是您所需要的。

from nltk.util import bigrams

...

for words, tags in training_data:
            self.unigram.update(tags)
            self.bigram.update(bigrams(tags))
            self.cooc.update(zip(words,tags))
            self.wordcount.update(words)
...

如何使用Counter（）来计算带有列表traning_data的unigram，bigram，cooc和wordcount？

1 个答案: