Question

我正在使用python为Qualtrics在线调查自动生成qsf个文件。 qsf文件要求使用\u+hex约定转义unicode字符：'слово'='\ u0441 \ u043b \ u043e \ u0432 \ u043e'。目前，我通过以下表达式实现此目的：

'слово'.encode('ascii','backslashreplace').decode('ascii')

输出正是我所需要的，但由于这是一个两步过程，我想知道是否有更有效的方法来获得相同的结果。

Answer 1

如果以＆＃39; wb＆＃39;打开输出文件，则它接受字节流而不是unicode参数：

s = 'слово'
with open('data.txt','wb') as f:
    f.write(s.encode('unicode_escape'))
    f.write(b'\n')  # add a line feed

这似乎可以做你想要的：

$ cat data.txt
\u0441\u043b\u043e\u0432\u043e

它避免了解码以及将unicode写入文本流时发生的任何转换。

根据@ J.F.Sebastian的建议更新为使用编码（＆＃39; unicode_escape＆＃39;）。

％timeit报告它比编码快得多（＆＃39; ascii＆＃39;，＆＃39; backslashreplace＆＃39;）：

In [18]: f = open('data.txt', 'wb')

In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop

In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop

In [21]: f.close()

奇怪的是，编码时间的延迟（＆＃39; unicode_escape＆＃39;）比编码时间长（＆＃39; ascii＆＃39;＆＃39; backslashreplace＆＃39;）尽管很长每个循环时间更快，因此请务必在您的环境中进行测试。

Answer 2

我怀疑它是您应用中的性能瓶颈，但s.encode('unicode_escape')可能比s.encode('ascii', 'backslashreplace')更快。

为避免手动调用.encode()，您可以将编码传递给open()：

with open(filename, 'w', encoding='unicode_escape') as file:
    print(s, file=file)

注意：它也会翻译不可打印的ascii字符，例如，换行符为\n，制表符为\t等。

更有效的方法来制作unicode转义码

2 个答案: