Question

我正在尝试在文件中执行一些替换：

'\t' --> '◊'
 '⁞' --> '\t'

This question建议采用以下程序：

import fileinput

with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
    for line in file:
        line = line.replace('\t','◊')
        print(line.replace('⁞','\t'), end='')

我不允许在那里发表评论，但是当我运行这段代码时，我得到一个错误说：

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 10: character maps to <undefined>

我之前通过添加encoding='utf-8'补救了这种错误。问题是fileinput.FileInput()不允许编码参数。

问题：如何摆脱这个错误？

上述解决方案，如果它可以工作并且速度可以与下面的方法相比，那么最让我高兴的是。它似乎正在进行现场替换，因为它应该完成。

我也试过了：

replacements = {'\t':'◊', '⁞':'\t'}
with open(filename, encoding='utf-8') as inFile:
    contents = inFile.read()
with open(filename, mode='w', encoding='utf-8') as outFile:
    for i in replacements.keys():
        contents = contents.replace(i, replacements[i])
    outFile.write(contents)

相对较快，但在内存方面非常贪婪。

对于UNIX用户，我需要做以下事情：

sed -i 's/\t/◊/g' 'file.csv'
sed -i 's/⁞/\t/g' 'file.csv'

事实证明这很慢。

Answer 1

通常，使用FileInput，您可以指定要将fileinput.hook_encoded作为openhook参数传递的编码：

import fileinput

with fileinput.FileInput(filename, openhook=fileinput.hook_encoded('utf-8')) as file:
    # ...

但是，这不适用于inplace=True。在这种情况下，您可以将文件视为二进制文件，并自行解码/编码字符串。对于阅读，只需指定mode='rb'就可以完成此操作，这会为您提供bytes代替str行。对于编写它来说有点复杂，因为print总是使用str，或者将给定的输入转换为str，因此传递字节将无法按预期工作。但是，您可以直接write binary data to sys.stdout，这将有效：

import sys
import fileinput

filename = '...'
with fileinput.FileInput(filename, mode='rb', inplace=True, backup='.bak') as file:
    for line in file:
        line = line.decode('utf-8')
        line = line.replace('\t', '◊')
        line = line.replace('⁞', '\t')
        sys.stdout.buffer.write(line.encode('utf-8'))

文件中的原位（多个）替换

1 个答案: