使用Python计算N Grams

时间:2012-11-16 20:26:35

标签: python nlp nltk n-gram

我需要为包含以下文字的文本文件计算Unigrams,BiGrams和Trigrams:

“仅在美国,囊性纤维化影响了30,000名儿童和年轻人 吸入盐水雾可以减少充满囊性纤维化患者呼吸道的脓液和感染,尽管副作用包括令人讨厌的咳嗽和严酷的味道。 这是本周出版的“新英格兰医学杂志”上发表的两项研究的结论。“

我从Python开始并使用以下代码:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

但它适用于一个单词中的所有n-gram,当我想要它来自CYSTIC和FIBROSIS或CYSTIC FIBROSIS之间的单词之间。有人可以帮我解决这个问题吗?

8 个答案:

答案 0 :(得分:32)

来自此blog的简短Pythonesque解决方案:

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

用法:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]

答案 1 :(得分:28)

假设输入是一个字符串,包含空格分隔的单词,如x = "a b c d",您可以使用以下函数(编辑:请参阅最后一个函数以获得更完整的解决方案):

def ngrams(input, n):
  input = input.split(' ')
  output = []
  for i in range(len(input)-n+1):
    output.append(input[i:i+n])
  return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

如果您希望将这些联接回字符串,您可以调用类似的内容:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

最后,这并没有将事情总结为总数,因此如果您的输入为'a a a a',则需要将它们计算为dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
   grams.setdefault(g, 0)
   grams[g] += 1

将所有这些组合成一个最终函数给出:

def ngrams(input, n):
  input = input.split(' ')
  output = {}
  for i in range(len(input)-n+1):
    g = ' '.join(input[i:i+n])
    output.setdefault(g, 0)
    output[g] += 1
  return output

ngrams('a a a a', 2) # {'a a': 3}

答案 2 :(得分:25)

使用NLTK(自然语言工具包)并使用函数将文本标记(拆分)到列表中,然后找到bigrams和trigrams。

import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)

答案 3 :(得分:8)

还有一个名为Scikit的python模块。这是代码。这将有助于您获得在特定范围内给出的所有克数。这是代码

from sklearn.feature_extraction.text import CountVectorizer 
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)

输出

[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']

这里给出了1到6范围内给出的所有克数。它使用了名为countVectorizer的方法。这是link

答案 4 :(得分:3)

使用collections.deque

from collections import deque
from itertools import islice

def ngrams(message, n=1):
    it = iter(message.split())
    window = deque(islice(it, n), maxlen=n)
    yield tuple(window)
    for item in it:
        window.append(item)
        yield tuple(window)

......或者你可以在一行中作为列表理解来做:

n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n)]

答案 5 :(得分:1)

nltk原生支持ngrams

'n'是ngram大小ex:n = 3用于trigram

from nltk import ngrams

def ngramize(texts, n):
    output=[]
    for text in texts:
        output += ngrams(text,n)
    return output

答案 6 :(得分:1)

虽然帖子很旧,但我想在这里提一下我的答案,以便大多数ngram创建逻辑可以在一个帖子中。

Python中有一些名称为TextBlob的东西。它很容易创建类似于NLTK的ngram。

以下是带有输出的代码段,以便于理解。

sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)

输出是:

unigrams
[WordList(['This']),
 WordList(['is']),
 WordList(['to']),
 WordList(['show']),
 WordList(['the']),
 WordList(['usage']),
 WordList(['of']),
 WordList(['Text']),
 WordList(['Blob']),
 WordList(['in']),
 WordList(['Python'])]

bigrams
[WordList(['This', 'is']),
 WordList(['is', 'to']),
 WordList(['to', 'show']),
 WordList(['show', 'the']),
 WordList(['the', 'usage']),
 WordList(['usage', 'of']),
 WordList(['of', 'Text']),
 WordList(['Text', 'Blob']),
 WordList(['Blob', 'in']),
 WordList(['in', 'Python'])]

trigrams
[WordList(['This', 'is', 'to']),
 WordList(['is', 'to', 'show']),
 WordList(['to', 'show', 'the']),
 WordList(['show', 'the', 'usage']),
 WordList(['the', 'usage', 'of']),
 WordList(['usage', 'of', 'Text']),
 WordList(['of', 'Text', 'Blob']),
 WordList(['Text', 'Blob', 'in']),
 WordList(['Blob', 'in', 'Python'])]

就这么简单。

TextBlob还有更多内容。有关详细信息,请查看此文档 - https://textblob.readthedocs.io/en/dev/

答案 7 :(得分:1)

如果效率是一个问题,你必须建立多个不同的n-gram,我会考虑使用以下代码(建立在Franck的优秀答案上):

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

〜与NLTK相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)