Python:NLTK-正则表达式令牌生成器产生空输出

时间:2019-01-08 07:34:54

标签: python nlp nltk

我正在尝试标记化NLTK教科书上的可用文本(使用python 2.7),但是输出与预期不符。有什么我想念的吗?

text = 'That U.S.A. poster-print costs $12.40...'

pattern = r'''(?x)     # set flag to allow verbose regexps
   ([A-Z]\.)+          # abbreviations, e.g. U.S.A.
   | \w+(-\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
   '''

nltk.regexp_tokenize(text, pattern)


Output: 
 [('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

Expected:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

0 个答案:

没有答案