Question

（Python 3.3.2）我必须通过调用re.escape（）返回一些非ASCII转义字符。我看到here和here方法不起作用。我在100％UTF-8环境中工作。

# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )

# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)

# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"

这是一个错误吗？我误解了什么吗？

任何帮助将不胜感激！

PS：感谢Michael Foukarakis的评论，我编辑了我的帖子。

Answer 1

你似乎误解了编码。为了防止常见错误，我们通常在离开应用程序时对其进行编码，并在进入时对其进行解码。

首先，让我们看一下unicode_escape的文档，其中指出：

在Python源代码中生成一个适合作为Unicode文字的字符串。

以下是您从网络获得的内容或声称其内容为Unicode转义的文件：

b'\\u20ac\\n'

现在，你必须解码它才能在你的应用中使用它：

>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'

如果你想把它写回一个Python源文件：

with open('/tmp/foo', 'wb') as fh: # binary mode
    fh.write(b'print("' + s.encode('unicode_escape') + b'")')

Answer 2

我猜你需要处理的实际字符串是mystring = €\\n？

mystring = "€\n"  # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"

我真的不明白python3的encode()和decode()中出了什么问题，但我的朋友在编写一些工具时解决了这个问题。

在转义过程完成后，我们如何绕过 encoder("utf_8")。

>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n'  # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n'  # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'

我们可以看到：虽然decode("unicode_escape")的结果看起来有线，但bytes对象实际上包含字符串的正确字节（使用utf-8编码），在这种情况下，{{1 }}

我们现在不直接打印"\xe2\x82\xac\n"对象，我们也不使用str，我们使用encode("utf_8")来创建ord()对象bytes。

您可以从此b'\xe2\x82\xac\n'对象中获取正确的str，只需将其放入bytes

顺便说一下，我朋友和我想做的工具是一个包装器，允许用户输入类似c的字符串文字，并自动转换转义序列。

str()

这是一个强大的工具，供用户在终端中输入一些不可打印的字符。

我们的最终工具是：

User input:\n\x61\x62\n\x20\x21  # 20 characters, which present 6 chars semantically
output:  # \n
ab       # \x61\x62\n
 !       # \x20\x21

Answer 3

import string
printable = string.printable
printable = printable + '€'

def cod(c):
    return c.encode('unicode_escape').decode('ascii')

def unescape(s):
    return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)

mystring = "€\n"
print(unescape(mystring))

不幸的是string.printable只包含ASCII字符。您可以像我在这里制作副本，并使用您想要的任何Unicode字符进行扩展，例如€。

Python：转义非ascii字符

3 个答案: