Question

我已经开始使用C ++中的ICU库了。

UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(u8"\U0001F674"));
ucs = ucs.unescape();
size_t len = ucs.length();

然而，len = 2。为什么？我只添加了一个4字节字符（https://unicode-table.com/en/1F674/）。有没有办法，如何返回正确的长度？

我希望长度为1，因为只有1个代码点。如果我使用

UnicodeString::fromUTF8(StringPiece(u8"\u06b5"));
ucs = ucs.unescape();
size_t len = ucs.length();

我得到了正确的len = 1

Answer 1

UnicodeString使用UTF-16，而不是UTF-8。

在UTF-16中，代码点U+1F674需要两个2字节代码单元：0xD83D 0xDE74。而codepoint U+06B5只需要一个2字节的代码单元：0x06B5。

Answer 2

要回答原始问题，为了获取UnicodeString中的代码点数，请使用UnicodeString::countChar32。

- Shane（来自ICU团队）