将字节解码为unicode字符串

时间:2013-11-26 06:41:20

标签: python unicode python-3.x encoding utf-8

问题是如何提取字符串,在字符串中表示为字节(警告)?我的意思是:

>>> s1 = '\\xd0\\xb1'  #  But this is NOT bytes of s1! s1 should be 'б'!
'\\xd0\\xb1'
>>> s1[0]
'\\'
>>> len(s1)            #  The problem is here: I thought I would see (2), but:
8
>>> type(s1)
<class 'str'>
>>> type(s1[0])
<class 'str'>
>>> s1[0] == '\\'
True

那么如何将 s1 转换为'б'(西里尔符号 - '\ xd0 \ xb1'的真实表示)。我已经在这里问了一个类似的问题,但我的不好被误解为 s1 的真实代表性观点(我认为'\' '\',而不是'\\')。

2 个答案:

答案 0 :(得分:3)

>>> s1 = b'\xd0\xb1' 
>>> s1.decode("utf8")
'б'
>>> len(s1)
2

答案 1 :(得分:3)

尝试以下代码。警告,它只是一个概念证明。当文本还包含写为非转义序列的字符时,必须以更复杂的方式进行替换(稍后我会在需要时显示)。请参阅以下评论。

import binascii

s1 = '\\xd0\\xb1'
print('s1 =', repr(s1), '=', list(s1))            # list() to emphasize what are the characters

s2 = s1.replace('\\x', '')
print('s2 =', repr(s2))

b = binascii.unhexlify(s2)
print('b =', repr(b), '=', list(b))

s3 = b.decode('utf8')
print('s3 =', ascii(s3))

with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(s3)

它打印在concole:

c:\__Python\user\so20210201>py a.py
s1 = '\\xd0\\xb1' = ['\\', 'x', 'd', '0', '\\', 'x', 'b', '1']
s2 = 'd0b1'
b = b'\xd0\xb1' = [208, 177]
s3 = '\u0431'

它将字符写入output.txt文件。

问题在于该问题结合了unicode转义和转义二进制值。换句话说,unicode字符串可以包含某种以某种方式表示二进制值的序列;但是,您不能直接将该二进制值强制转换为unicode字符串,因为任何unicode字符实际上都是一个抽象整数,并且整数可以用多种方式表示(作为一个字节序列)。

如果unicode字符串包含\\n之类的转义序列,则可以使用&#39; unicode_escape&#39; bytes.decode()的处方。但是,在这种情况下,您需要从ascii转义序列解码,然后从utf-8解码。

更新:这是一个用其他ascii字符转换你的字符串的函数(即不仅仅是转义序列)。该函数使用有限自动机 - 最初可能看起来太复杂(实际上它只是冗长的)。

def userDecode(s):
    status = 0
    lst = []                       # result as list of bytes as ints
    xx = None                      # variable for one byte escape conversion
    for c in s:                    # unicode character
        print(status, ' c ==', c)  ## just for debugging
        if status == 0:
            if c == '\\':
                status = 1         # escape sequence for a byte starts
            else:
                lst.append(ord(c)) # convert to integer

        elif status == 1:          # x expected
            assert(c == 'x')
            status = 2

        elif status == 2:          # first nibble expected
            xx = c
            status = 3

        elif status == 3:          # second nibble expected
            xx += c
            lst.append(int(xx, 16)) # this is a hex representation of int
            status = 0

    # Construct the bytes from the ordinal values in the list, and decode
    # it as UTF-8 string.
    return bytes(lst).decode('utf-8')


if __name__ == '__main__':

    s = userDecode('\\xd0\\xb1whatever')
    print(ascii(s))    # cannot be displayed on console that does not support unicode

    with open('output.txt', 'w', encoding='utf-8') as f:
        f.write(s)

同样查看生成的文件。删除调试打印。它在控制台上显示以下内容:

c:\__Python\user\so20210201>b.py
0  c == \
1  c == x
2  c == d
3  c == 0
0  c == \
1  c == x
2  c == b
3  c == 1
0  c == w
0  c == h
0  c == a
0  c == t
0  c == e
0  c == v
0  c == e
0  c == r
'\u0431whatever'