Question

我正在尝试从unicode字符串中删除标点符号，该字符串可能包含非ascii字母。我尝试使用regex模块：

import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)

但是，我注意到字符<和>没有被删除。有没有人知道为什么，有没有其他方法从unicode字符串中删除标点符号？

编辑：我尝试过的另一种方法是：

import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

但我想避免将文本从unicode转换为字符串和向后。

Answer 1

<和>被归类为Math Symbols (Sm)，而不是标点符号（P）。您可以匹配：

regex.sub('[\p{P}\p{Sm}]+', '', text)

unicode.translate()方法也存在，并且将字典映射整数（代码点）映射到其他整数代码点，unicode字符或None; None删除该代码点。将string.punctuation映射到ord()的代码点：

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

仅删除有限数量的 ASCII 标点字符。

演示：

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

如果string.punctuation不够，那么您可以通过从0到str.translate()迭代为所有P和Sm代码点生成完整的sys.maxunicode映射，然后针对unicodedata.category()测试这些值：

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

（对于Python 3，将unicode替换为str，将print ...替换为print(...))。

Answer 2

试试string模块

import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)

打印 -

Üäik
<type 'unicode'>

Answer 3

\p{P}匹配标点字符。

这些标点符号是

! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~

<和>不是标点字符。所以他们不会被删除。

试试这个

re.sub('[\p{L}<>]+',"",text)

从unicode字符串中删除特殊字符和标点符号

3 个答案: