我的表格中有一个字符串:
'I am going to visit "Huge Hotel" and the "Grand River"'
我想把它标记为
['I', 'am', 'going',..., 'Huge Hotel','and' ,'the' ,'Grand River']
如所见'Huge Hotel'和'Grand River'被视为一个单词,因为它们出现在引号中。
import nltk
text = 'I am going to visit "Huge Hotel" and the "Grand River"'
b = nltk.word_tokenize(text)
我已经编写了上面的代码,但它无法正常工作
答案 0 :(得分:1)
看起来很奇怪,但确实有效:
re.findall('"([^"]*)"', s)
:查找用双引号括起来的所有子字符串phrase.replace(' ', '_')
:在步骤1的这些子字符串中替换所有带下划线的空格。word_tokenize()
。[OUT]:
>>> import re
>>> from nltk import word_tokenize
>>> s = 'I am going to visit "Huge Hotel" and the "Grand River"'
>>> for phrase in re.findall('"([^"]*)"', s):
... s = s.replace('"{}"'.format(phrase), phrase.replace(' ', '_'))
...
>>> s
'I am going to visit Huge_Hotel and the Grand_River'
>>> word_tokenize(s)
['I', 'am', 'going', 'to', 'visit', 'Huge_Hotel', 'and', 'the', 'Grand_River']
我确信这是一个更简单的正则表达式操作,可以取代一系列正则表达式+字符串操作。