使用正则表达式查找和替换操作而内部功能不起作用

时间:2018-07-31 17:43:57

标签: python regex python-3.x dictionary

我是Stack Overflow的新手,希望有人可以提供以下代码来帮助我。

我正在尝试改编Ascher,Ravenscroft和Martelli Python Cookbook中的一段代码。我想使用字典key:value对(所有文本均为utf-8),将Text中包含“ long-s”的所有单词替换为用现代小写字母s拼写的等效单词。我可以从现有的制表符分隔的文件中构建字典,而不会出现问题(我在代码中使用了简单的示例字典,以便于编辑),但是我想一次完成所有更改以提高速度和效率。我删除了代码的mapescape部分,因为我认为'long-s'不需要转义(尽管我可能错了!)。第一部分工作正常,但是内部函数one_xlat似乎没有任何作用。最后,它不会返回/打印Text,并且没有错误消息。我已经在命令行和IDLE中运行了代码,结果相同。我已经在使用和不使用mapescape的情况下运行了代码,为了确保可以重命名这些变量,但是我不能完全使其正常工作。有人可以帮忙吗?抱歉,如果我遗漏了一些明显的东西,并非常感谢您。

Ascher,Ravenscroft和Martelli的原始代码:

import re
def multiple_replace(text, adict):
    rx = re.compile('|'.join(map(re.escape, adict)))
    def one_xlat(match):
        return adict[match.group(0)]
    return rx.sub(one_xlat, text)

改编版本:

import re

adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."

def word_replace(text, adictCR):
    regex_dict = re.compile('|'.join(adictCR))
    print(regex_dict)
    def one_xlat(match):
        return adictCR[match.group(0)]
    return regex_dict.sub(one_xlat, text)
    print(text)

word_replace(text, adictCR)

1 个答案:

答案 0 :(得分:0)

我会这样重写您的代码:

# -*- coding: utf-8 -*-
import re

adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."

new_s=[]        
for g in (m.group(0) for m in re.finditer(r'\w+|\W+', text)):
    if g in adictCR:
        g=adictCR[g]
    new_s.append(g)

然后您可以使用''.join(new_s)获取新字符串。

注意:模式'\w+|\W+'仅在具有非ascii文本的Python的最新版本(3.1+)中起作用。您也可以替代split(r'(\W)', str),但我认为这不适用于utf-8的Python 2。

相关问题