如何使用Counter()来计算带有列表traning_data的unigram,bigram,cooc和wordcount?

时间:2015-07-28 03:38:30

标签: python nlp

我想知道如何使用Counter()来计算带有列表traning_data的unigram,bigram,cooc和wordcount。

我是蟒蛇新人,请耐心等待。谢谢!

您需要实施HMM postagger的两个部分。

  1. HMM模型
  2. 维特比解码 这是代码:

    from collections import Counter
    from math import log
    
    class HMM(object):
        def __init__(self, epsilon=1e-5, training_data=None):
            self.epsilon = epsilon
            if training_data is not None:
                self.fit(training_data)
    def fit(self, training_data):
    '''
    Counting the number of unigram, bigram, cooc and wordcount from the training
    data.
    
    Parameters
    ----------
    training_data: list
        A list of training data, each element is a tuple with words and postags.
    '''
    self.unigram = Counter()    # The count of postag unigram, e.g. unigram['NN']=5
    self.bigram = Counter()     # The count of postag bigram, e.g. bigram[('PRP', 'VV')]=1
    self.cooc = Counter()       # The count of word, postag, e.g. cooc[('I', 'PRP')]=1
    self.wordcount = Counter()  # The count of word, e.g. word['I']=1
    
    print('building HMM model ...')
    for words, tags in training_data:
        # Your code here! You need to implement the ngram counting part. Please count
        # - unigram
        # - bigram
        # - cooc
        # - wordcount
    
    print('HMM model is built.')
    self.postags = [k for k in self.unigram]
    
  3. 这是training_dataset,预期结果如下:

        # The tiny example.
        training_dataset = [(['dog', 'chase', 'cat'], ['NN', 'VV', 'NN']),
                    (['I', 'chase', 'dog'], ['PRP', 'VV', 'NN']),
                    (['cat', 'chase', 'mouse'], ['NN', 'VV', 'NN'])
                   ]
    
        hmm = HMM(training_data=training_dataset)
    
        # Testing if the parameter are correctly estimated.
        assert hmm.unigram['NN'] == 5
        assert hmm.bigram['VV', 'NN'] == 3
        assert hmm.bigram['NN', 'VV'] == 2
        assert hmm.cooc['dog', 'NN'] == 2
    

1 个答案:

答案 0 :(得分:0)

Counter()与列表结合使用非常简单。 Counter.update()正是您所需要的。

from nltk.util import bigrams

...

for words, tags in training_data:
            self.unigram.update(tags)
            self.bigram.update(bigrams(tags))
            self.cooc.update(zip(words,tags))
            self.wordcount.update(words)
...