用单词作为句子单位的NLTK句子标记化

时间:2015-07-22 12:50:10

标签: python nlp nltk

我希望Python能够存储单词,而不是字符作为句子中的基本单位。

import nltk
from nltk.tokenize import RegexpTokenizer

sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
word_tokenizer = RegexpTokenizer(r'\w+')

my_text = 'WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about ''the pain of a broken trust'' that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement. Frankly, the onus is on law enforcement because we are the ones who have taken the oath to protect and to serve the people of this city,'' Ms. Lynch said in 2000.'

len(my_text)
Out[129]: 498

my_sents = sent_tokenizer.tokenize(my_text)

len(my_sents)
Out[132]: 2

但是如果我要求输出第一句话的长度 - 它给出了它的字符长度:

len(my_sents[0])
Out[133]: 337

我可以通过对句子进行标记来获得单个单词(没有结构化为句子):

my_words = word_tokenizer.tokenize(str(sents))
len(my_words)
Out[140]: 86

但是可以将单词存储在句子结构中吗?例如 -

print 'The sentence has ', len(my_sents[0]), ' words'
The sentence has 64 words

1 个答案:

答案 0 :(得分:0)

import nltk
nltk.word_tokenize("Tokenize this!")

结果

['Tokenize', 'this', '!']

这就是你要追求的吗?