Question

我需要将（可能很大的）字符串转换为UTF-8，但我不想创建包含完整编码的字节数组。我的想法是为此使用CharsetEncoder，但CharsetEncoder仅对CharBuffer起作用，这意味着补充字符（Unicode范围0x0000到0xFFFF之外）应该考虑。

现在，使用的方法是CharBuffer.wrap(String.substring(start, start + BLOCK_SIZE))，我的ByteBuffer是使用ByteBuffer.allocate((int) Math.ceil(encoder.maxBytesPerChar() * BLOCK_SIZE))创建的。但是，CharBuffer现在将包含BLOCK_SIZE代码点，而不是代码单元（字符）;我认为实际的字符数量最多为BLOCK_SIZE两倍。这意味着我的ByteBuffer也是两倍太小了。

如何计算ByteBuffer的正确字节数？我可以简单地加倍它，以防每个角色都是一个补充角色，但这似乎有点多。但唯一合理的选择似乎是迭代所有代码单元（字符）或代码点，至少看起来次优。

任何关于什么是最有效的编码Strings零碎方法的提示？我应该使用缓冲区，使用String.codePointAt(location)进行迭代，还是有一个直接处理代码点的编码例程？

附加要求：无效的字符编码应导致异常，不能允许默认替换或跳过无效字符。

Answer 1

似乎更容易简单地包裹整个字符串，然后盲目地读取字符，直到没有剩余字符。无需在部分中剪切字符串，编码器将只读取字节，直到输出缓冲区被填满：

final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
final CharBuffer buffer = CharBuffer.wrap(input);
final ByteBuffer encodedBuffer = ByteBuffer.allocate(BUFFER_SIZE);
CoderResult coderResult;

while (buffer.hasRemaining()) {
    coderResult = encoder.encode(buffer, encodedBuffer, false);
    if (coderResult.isError()) {
        throw new IllegalArgumentException(
                "Invalid code point in input string");
    }
    encodedBuffer.flip();
    // do stuff with encodedBuffer
    encodedBuffer.clear();
}

// required by encoder: call encode with true to indicate end
coderResult = encoder.encode(buffer, encodedBuffer, true);
if (coderResult.isError()) {
    throw new IllegalArgumentException(
            "Invalid code point in input string");
}
encodedBuffer.flip();
// do stuff with encodedBuffer
encodedBuffer.clear(); // if still required

使用缓冲区将字符串转换为UTF-8

1 个答案: