Python仅提取单词

时间:2017-05-01 21:57:55

标签: python nltk

目前我一直在使用此函数仅提取仅英文字符串和Unicode字符串的有效字词:

s = """\"A must-read for the business leader of today and tomorrow."--John G. O'Neill, Vice President, 3M Canada. High Performance Sales Organizations defined the true nature of market-focused sales and service operations, and helped push sales organizations into the 21st century"""
t = 'Life is life (I want chocolate);&'
w = u'Tú te llamabas de niña Concepción Morales!!'

def clean_words(text, separator=' '):
  if isinstance(text, unicode):
    return separator.join(re.findall(r'[\w]+', text, re.U)).rstrip()
  else:
    return re.sub(r'\W+', ' ', text).replace(' ', separator).rstrip()

似乎有姓氏和撇号的问题,有什么建议吗? 它返回s:

 A must read for the business leader of today and tomorrow John G O Neill Vice President 3M Canada High Performance Sales Organizations defined the true nature of market focused sales and service operations and helped push sales organizations into the 21st century

当我对它进行标记时会产生单个字符。

有什么建议吗?

2 个答案:

答案 0 :(得分:1)

看起来它是您想要的Treebank标记器:

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)
#['``', 'A', 'must-read', 'for', 'the', 'business', 'leader', 'of',
# 'today', 'and', 'tomorrow.', "''", '--', 'John', 'G.', "O'Neill",
# ',', 'Vice', 'President', ',', '3M', 'Canada.', 'High', 
# 'Performance', 'Sales', 'Organizations', 'defined', 'the', 'true', 
# 'nature', 'of', 'market-focused', 'sales', 'and', 'service', 
# 'operations', ',', 'and', 'helped', 'push', 'sales', 
# 'organizations', 'into', 'the', '21st', 'century']

答案 1 :(得分:1)

或者,您可以使用spacy

import spacy
nlp = spacy.load('en')
s_tokenized = [t.text for t in nlp(s)]

# ['"', 'A', 'must', '-', 'read', 'for', 'the', 'business', 'leader', 'of',
#  'today', 'and', 'tomorrow', '."--', 'John', 'G.', "O'Neill", ',', 'Vice',
#  'President', ',', '3', 'M', 'Canada', '.', 'High', 'Performance', 'Sales',
#  'Organizations', 'defined', 'the', 'true', 'nature', 'of', 'market', '-',
#  'focused', 'sales', 'and', 'service', 'operations', ',', 'and', 'helped',
#  'push', 'sales', 'organizations', 'into', 'the', '21st', 'century']