Question

我有大量的替换清单，如下所示。

重新加载文件list.txt：

人の,NN
人の名前,FF

要替换text.txt的数据：

aaa人の abc 人の名前def ghi

我想使用list.txt替换此文字，如下所示。

>>> my_func('aaa人の abc 人の名前def ghi')
'aaaNN abc FFdef ghi'

这是我的代码。但我认为处理大数据效率非常低。

d = {}
with open('list.txt', 'r', encoding='utf8') as f:
    for line in f:
        line = line.strip()
        d[line.split(',')[0]] = line.split(',')[1]

with open('text.txt', 'r', encoding='utf8') as f:
    txt = f.read()

st = 0
lst = []

# \u4e00-\u9fea\u3040-\u309f] means the range of unicode of Japanese character
for match in re.finditer(r"([\u4e00-\u9fea\u3040-\u309f]+)", txt):
    st_m, ed_m = match.span()
    lst.append(txt[st:st_m])

    search = txt[st_m:ed_m]
    rpld = d[search]
    lst.append(rpld)

    st = ed_m

lst.append(txt[st:])

print(''.join(lst))

请让我知道更好的方式。

Answer 1

看到您的输入aaa人の abc 人の名前def ghi后，我发现您之间有white-spaces。所以它不是word replace它更像phrase replace。

如果您需要word replacement
，可以参考编辑记录以查看旧答案

在这种情况下你有短语替换，你可以使用re（reg-ex）并提供一系列替换。以下是一个实现：

>>> import re
>>> _regex = {r'aaa人の abc 人の名前def ghi': r'人の,NN 人の名前,FF'}
>>> input_string = 'hi aaa人の abc 人の名前def ghi work'
>>> for pattern in _regex.keys():
        input_string = re.sub(pattern, _regex[pattern], input_string)


>>> input_string
'hi 人の,NN 人の名前,FF work'
>>>

以下是上述

的object oriented实现

import csv
import re


class RegexCleanser(object):
    _regex = None

    def __init__(self, input_string: str):
        self._input_string = input_string
        self._regex = self._fetch_rows_as_dict_keys(r'C:\Users\adity\Desktop\japsyn.csv')

    @staticmethod
    def _fetch_rows_as_dict_keys(file_path: str) -> dict:
        """
        Reads the data from the file
        :param file_path: the path of the file that holds the lookup data
        :return: the read data
        """
        try:
            word_map = {}
            for line in csv.reader(open(file_path, encoding='UTF-8')):
                word, syn = line
                word_map[word] = syn
            return word_map
        except FileNotFoundError:
            print(f'Could not find the file at {file_path}')

    def clean(self)-> str:
        for pattern in self._regex.keys():
            self._input_string = re.sub(pattern, self._regex[pattern], self._input_string)
        return self._input_string

<强>用法：

if __name__ == '__main__':
    cleaner = RegexCleanser(r'hi aaa人の abc 人の名前def ghi I dont know this language.')
    clean_string = cleaner.clean()
    print(clean_string)

在python上替换字符串的有效方法是什么？

1 个答案: