累计频率,Ngrams

时间:2012-10-26 12:07:48

标签: python regex nltk

快速提问:如果你运行下面的代码,你会从语料库中得到每个列表中bigrams的频率列表。

我希望能够显示和跟踪总运行记录。 IE而不是你在频率为1或2时运行时所看到的,因为索引太小,它会计入整个语料库并显示频率。

然后我基本上需要从模拟原始语料库的频率生成文本。

   #---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project

#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown

#---------------------------------------------------------

#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'

#---------------------------------------------------------


#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
    #Simply check for an instance of a period, and if so, replace with '</s>'
    if corpus_list[-1] == '.':
        corpus_list[-1] = '</s>'
        #Stripe is a modifier that allows us to remove all special characters, IE '\n'
        corpus_list[-1].strip()
    #Else add to the end of the list item
    else:
        corpus_list.append('</s>')
    return ['<s>'] + corpus_list

#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
    print "The Corpus Annotated with <s> and </s> looks like : "
    print "Displaying [",user,"] rows of the corpus : ", '\n' 
    for corpus_list in news[:user]:
       print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
    print "Fine I Won't Show You Any... ",'\n'

#---------------------------------------------------------

print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0

#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
    passer = news[count]

    def ngrams(passer, n = 2, padding = True):
        #Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
        #'None' the first item in each list so that calculations of frequencies can be made 
        pad = [] if not padding else [None]*(n-1)
        grams = pad + passer + pad
        return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))

    #In this case, arguments are first: n-gram type (bi, tri, quad)
    #Followed by in our case the addition of 'padding'
    #Padding is used in every case here because we need it for calculations
    #This function structure allows us to pull in corpus parts without the added annotations if need be
    for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
        print '\n%d - grams || padding = %d' % (size, padding)
        print list(ngrams(passer, size, padding))

    # show frequency
    counts = defaultdict(int)
    for n_gram in ngrams(passer, 2, False):
        counts[n_gram] += 1

    print ("======================================================================================")
    print '\nFrequencies Of Bigrams:'
    for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
        print c, n_gram

    print '\nFrequencies Of Trigrams:'
    for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
        print c, n_gram

    count = count + 1

 #---------------------------------------------------------

2 个答案:

答案 0 :(得分:1)

我不确定我理解这个问题。 nltk有一个函数generate。 nltk来自的书可在线获取。

http://nltk.org/book/ch01.html

Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)

>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she

答案 1 :(得分:1)

问题是你为每个句子重新定义了dict counts,所以ngram计数重置为零。将它定义在while循环之上,计数将累积在整个Brown语料库中。

额外建议:你还应该将ngram的定义移到循环之外 - 一遍又一遍地定义相同的函数是没有意义的。 (但除了表演外,它没有任何伤害)。更好的是,您应该使用nltk的ngram函数并阅读FreqDist,这就像类固醇的字典计数器。当您处理统计文本生成时,它会派上用场。