用另一个文件中的单词替换替换单词

时间:2015-01-05 05:17:58

标签: python translation nltk

我的文本文件(mytext.txt)中的单词需要被另一个文本文件中提供的其他单词替换(replace.txt)

cat mytext.txt
this is here. and it should be there. 
me is this will become you is that.

cat replace.txt
this that
here there
me you

以下代码无法按预期工作。

with open('mytext.txt', 'r') as myf:
    with open('replace.txt' , 'r') as myr:
        for line in myf.readlines():
            for l2 in myr.readlines():
                original, replace = l2.split()
                print line.replace(original, replace)

预期产出:

that is there. and it should be there. 
you is that will become you is that.

6 个答案:

答案 0 :(得分:1)

您在一次更换后打印线,然后在下次更换后再次打印线。您想在完成所有替换后打印该行。

  

str.replace(old,new [,count])
  返回字符串的副本 ...

您每次都要丢弃该副本,因为您没有将其保存在变量中。换句话说,replace()不会更改line

接下来,单词there包含子字符串here(由there替换),因此结果最终为tthere

你可以解决这些问题:

import re

with open('replace.txt' , 'r') as f:
    repl_dict = {}

    for line in f:
        key, val = line.split()
        repl_dict[key] = val


with open('mytext.txt', 'r') as f:
    for line in f:
        for key, val in repl_dict.items():
            line = re.sub(r"\b" + key + r"\b", val, line, flags=re.X)
        print line.rstrip()

--output:--
that is there. and it should be there. 
you is that will become you is that.

或者,像这样:

import re

#Create a dict that returns the key itself
# if the key is not found in the dict:
class ReplacementDict(dict):
    def __missing__(self, key):
        self[key] = key
        return key

#Create a replacement dict:
with open('replace.txt') as f:
    repl_dict = ReplacementDict()

    for line in f:
        key, val = line.split()
        repl_dict[key] = val

#Create the necessary inputs for re.sub():
def repl_func(match_obj):
    return repl_dict[match_obj.group(0)]

pattern = r"""
    \w+   #Match a 'word' character, one or more times
"""

regex = re.compile(pattern, flags=re.X)

#Replace the words in each line with the 
#entries in the replacement dict:
with open('mytext.txt') as f:
    for line in f:
        line = re.sub(regex, repl_func, line)
        print line.rstrip())

使用replace.txt,如下所示:

this that
here there
me you
there dog

...输出是:

that is there. and it should be dog.
you is that will become you is that.

答案 1 :(得分:1)

以下内容将解决您的问题。您的代码存在的问题是每次更换后都要打印。

最佳解决方案是:

myr=open("replace.txt")
replacement=dict()
for i in myr.readlines():
    original,replace=i.split()
    replacement[original]=replace
myf=open("mytext.txt")
for i in myf.readlines():
    for j in i.split():
        if(j in replacement.keys()):
            i=i.replace(j,replacement[j])
    print i

答案 2 :(得分:1)

看起来你希望你的内循环读取' replace.txt'的内容。对于mytext.txt'的每一行。这是非常低效的,并且它实际上不会像书面那样工作,因为一旦你读完了所有的行,就会发现'.txt''文件指针留在文件的末尾,所以当你试图处理第二行' mytext.txt'没有任何一行可以阅读< replace.txt'。

可以使用myr.seek(0)将myr文件指针发送回文件的开头,但正如我所说,这不是很有效。一个更好的策略是阅读&replace; .txt'进入适当的数据结构,然后使用该数据在'mytext.txt'的每一行上进行替换。

用于此目的的良好数据结构是dict。例如,

replacements = {'this': 'that', 'here': 'there', 'me': 'you'}

你能弄清楚如何从< replace.txt'

中建立这样一个词典吗?

我看到gman和7stud已经涵盖了保存替换结果的问题,以便他们积累,所以我不打扰讨论这个问题。 :)

答案 3 :(得分:1)

在这里使用re.sub

>>> with open('mytext.txt') as f1, open('replace.txt') as f2:
...     my_text = f1.read()
...     for x in f2:
...         x=x.strip().split()
...         my_text = re.sub(r"\b%s\b" % x[0],x[1],my_text)
...     print my_text
... 
that is there. and it should be there. 
you is that will become you is that.

\b%s\b定义单词边界

答案 4 :(得分:1)

编辑:我更正了,OP要求逐字替换而不是简单的字符串替换('变成' - >'变成'而不是'becoyou')。我想dict版本可能看起来像这样,使用在Splitting a string into words and punctuation接受的答案的注释中找到的正则表达式拆分方法:

import re

def clean_split(string_input):
    """ 
    Split a string into its component tokens and return as list
    Treat spaces and punctuations, including in-word apostrophes as separate tokens

    >>> clean_split("it's a good day today!")
    ["it", "'", "s", " ", "a", " ", "good", " ", "day", " ", "today", "!"]
    """
    return re.findall(r"[\w]+|[^\w]", string_input)

with open('replace.txt' , 'r') as myr:
    replacements = dict(tuple(line.split()) for line in myr)

with open('mytext.txt', 'r') as myf:
    for line in myf:
        print ''.join(replacements.get(word, word) for word in clean_split(line)),

我无法胜任re效率,如果有人指出明显效率低下,我将非常感激。

编辑2:确定我在单词和标点之间插入了空格,现在通过将空格视为标记并执行''.join()而不是{{}来修复 {1}}

答案 5 :(得分:1)

作为替代方案,我们可以使用字符串的模板来实现这一点,它可以正常工作,但非常丑陋且低效但是:

from string import Template

with open('replace.txt', 'r') as myr:
    # read the replacement first and build a dictionary from it
    d = {str(k): v for k,v in [line.strip().split(" ") for line in myr]}

d
{'here': 'there', 'me': 'you', 'this': 'that'}

with open('mytext.txt', 'r') as myf:
    for line in myf:
        print Template('$'+' $'.join(line.strip().replace('$', '_____').\
                  split(' '))).safe_substitute(**d).\
                  replace('$', '').replace('_____', '')

结果:

that is there. and it should be there.
you is that will become you is that.
相关问题