Question

我正在写一个简单的应用程序，我想用其他单词替换某些单词。我使用单引号的字词遇到问题，例如aren't，ain't，isn't。

我有一个包含以下内容的文本文件

aren’t=ain’t
hello=hey

我解析文本文件并从中创建一个字典

u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'

然后我尝试替换给定文本中的所有字符

text = u"aren't"

def replace_all(text, dict):
    for i, k in dict.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

问题是re.sub()与u'aren\u2019t'的{{1}}不匹配。

我可以做什么，以便我的u"aren't"函数匹配replace_all()和“”不是“并用适当的文本替换它们？我可以在Python中做一些事情，以便我的字典不包含Unicode吗？我可以将我的文本转换为使用Unicode字符，还是可以修改正则表达式以匹配Unicode字符以及所有其他文本？

Answer 1

我想你的问题是：

text = u"aren't"

而不是：

text = u"aren’t"

（注意不同的撇号？）

以下是您修改的代码以使其正常工作：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

d = {
    u'aren’t': u'ain’t',
    u'hello': u'hey'
    }
#text = u"aren't"
text = u"aren’t"


def replace_all(text, d):
    for i, k in d.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

if __name__ == '__main__':
    newtext = replace_all(text, d)
    print newtext

输出：

ain’t

Answer 2

这在Python 2.6.4中适用于我：

>>> re.sub(ur'\baren\u2019t\b', 'rep', u'aren\u2019t')
u'rep'

确保您的模式字符串是Unicode字符串，否则它可能不起作用。

Answer 3

尝试将文件保存为UTF-8编码

Answer 4

u"aren\u2019t" == u"aren't"

假

u"aren\u2019t" == u"aren’t"

真

使用Python正则表达式处理Unicode字符

4 个答案: