Question

我的表格中有一个字符串：

'I am going to visit "Huge Hotel" and the "Grand River"'

我想把它标记为

['I', 'am', 'going',..., 'Huge Hotel','and' ,'the' ,'Grand River']

如所见'Huge Hotel'和'Grand River'被视为一个单词，因为它们出现在引号中。

import nltk
text = 'I am going to visit "Huge Hotel" and the "Grand River"'
b = nltk.word_tokenize(text)

我已经编写了上面的代码，但它无法正常工作

Answer 1

看起来很奇怪，但确实有效：

re.findall('"([^"]*)"', s)：查找用双引号括起来的所有子字符串
phrase.replace(' ', '_')：在步骤1的这些子字符串中替换所有带下划线的空格。
用步骤2中的下划线子串替换双引号中的所有字符串。
在修改后的字符串上使用word_tokenize()。

[OUT]：

>>> import re
>>> from nltk import word_tokenize
>>> s = 'I am going to visit "Huge Hotel" and the "Grand River"'
>>> for phrase in re.findall('"([^"]*)"', s):
...     s = s.replace('"{}"'.format(phrase), phrase.replace(' ', '_'))
... 
>>> s
'I am going to visit Huge_Hotel and the Grand_River'
>>> word_tokenize(s)
['I', 'am', 'going', 'to', 'visit', 'Huge_Hotel', 'and', 'the', 'Grand_River']

我确信这是一个更简单的正则表达式操作，可以取代一系列正则表达式+字符串操作。

对具有双引号

1 个答案: