Question

我是Stack Overflow的新手，希望有人可以提供以下代码来帮助我。

我正在尝试改编Ascher，Ravenscroft和Martelli Python Cookbook中的一段代码。我想使用字典key：value对（所有文本均为utf-8），将Text中包含“ long-s”的所有单词替换为用现代小写字母s拼写的等效单词。我可以从现有的制表符分隔的文件中构建字典，而不会出现问题（我在代码中使用了简单的示例字典，以便于编辑），但是我想一次完成所有更改以提高速度和效率。我删除了代码的map和escape部分，因为我认为'long-s'不需要转义（尽管我可能错了！）。第一部分工作正常，但是内部函数one_xlat似乎没有任何作用。最后，它不会返回/打印Text，并且没有错误消息。我已经在命令行和IDLE中运行了代码，结果相同。我已经在使用和不使用map和escape的情况下运行了代码，为了确保可以重命名这些变量，但是我不能完全使其正常工作。有人可以帮忙吗？抱歉，如果我遗漏了一些明显的东西，并非常感谢您。

Ascher，Ravenscroft和Martelli的原始代码：

import re
def multiple_replace(text, adict):
    rx = re.compile('|'.join(map(re.escape, adict)))
    def one_xlat(match):
        return adict[match.group(0)]
    return rx.sub(one_xlat, text)

改编版本：

import re

adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."

def word_replace(text, adictCR):
    regex_dict = re.compile('|'.join(adictCR))
    print(regex_dict)
    def one_xlat(match):
        return adictCR[match.group(0)]
    return regex_dict.sub(one_xlat, text)
    print(text)

word_replace(text, adictCR)

Answer 1

我会这样重写您的代码：

# -*- coding: utf-8 -*-
import re

adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."

new_s=[]        
for g in (m.group(0) for m in re.finditer(r'\w+|\W+', text)):
    if g in adictCR:
        g=adictCR[g]
    new_s.append(g)

然后您可以使用''.join(new_s)获取新字符串。

注意：模式'\w+|\W+'仅在具有非ascii文本的Python的最新版本（3.1+）中起作用。您也可以替代split(r'(\W)', str)，但我认为这不适用于utf-8的Python 2。

使用正则表达式查找和替换操作而内部功能不起作用

1 个答案: