Question

当我在Python 2.7.6中运行以下内容时，我得到一个异常：

import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
try:
    print (some_bytes.decode("utf-8"))
except Exception as e:
    print(e)

输出：

base 64 of the bytes:
gAID
'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

因此在Python 2.7.6中，表示为gAID的字节不是有效的UTF8。

当我在Java 8（HotSpot 1.8.0_74）中尝试使用此代码时：

java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);

我没有任何例外。

怎么样？为什么相同的字节数组在Java中有效，在Python中无效，使用UTF8解码？

Answer 1

这不是有效的UTF8。 https://en.wikipedia.org/wiki/UTF-8

0x80和0xBF之间的字节不能是多字节字符的第一个字节。它们只能是第二个字节或更晚。

Java替换了无法用?解码的字节，而不是抛出异常。

Answer 2

这是因为Java中的String构造函数在无效字符的情况下不会抛出异常。请参阅文档here

public String（byte [] bytes，Charset charset）

...此方法始终使用此charset的默认替换字符串替换格式错误的输入和不可映射的字符序列。当需要更多地控制解码过程时，应该使用CharsetDecoder类。

Answer 3

因此在Python 2.7.6中，表示为gAID的字节不是有效的UTF8。

当您尝试解码Base64编码的字节时，这是错误的。

import base64
some_bytes = b"\x80\x02\x03"
print ("base 64 of the bytes:")
print (base64.b64encode(some_bytes))
# store the decoded bytes
some_bytes = base64.b64encode(some_bytes)
decoded_bytes = [hex(ord(c)) for c in some_bytes]
print ("decoded bytes: ")
print (decoded_bytes)
try:
    print (some_bytes.decode("utf-8"))
except Exception as e:
    print(e)

输出

gAID
['0x67', '0x41', '0x49', '0x44']
gAID

在Java中，您尝试使用UTF-8字符集从Base64编码的字节创建String。在默认替换字符�中得出哪些结果（已经回答）。

运行以下代码段

java.util.Base64.Decoder decoder = java.util.Base64.getDecoder();
byte[] bytes = decoder.decode("gAID");
System.out.println("base 64 of the bytes:");
for (byte b : bytes) {
    System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);

产生以下输出

base 64 of the bytes:
x80 x02 x03 
?

在那里你可以看到你在Python片段中使用的相同字节。 Python中哪个引导'utf8' codec can't decode byte 0x80 in position 0: invalid start byte导致?（它代表非unicode控制台上的默认替换字符）

以下代码段使用gAID中的字节构造带有UTF-8字符集的String。

byte[] bytes = "gAID".getBytes(StandardCharsets.ISO_8859_1);
for (byte b : bytes) {
    System.out.printf("x%02x ", b);
}
System.out.println();
Charset charset = Charset.forName("UTF8");
String s = new String(bytes, charset);
System.out.println(s);

输出

x67 x41 x49 x44 
gAID

字节数组是Java中有效的UTF8编码字符串，但不是Python

3 个答案: