Question

我从TCP服务器收到一个字节流缓冲区，它可能包含形成unicode字符的多字节字符。我想知道是否总有办法检查BOM以检测这些字符，否则你想怎么做？

Answer 1

如果您知道数据是UTF-8，那么您只需检查高位：

0xxxxxxx =单字节ASCII字符
1xxxxxxx =多字节字符的一部分

或者，如果您需要区分引导/跟踪字节：

10xxxxxx =多字节字符的第2，第3或第4个字节
110xxxxx = 2字节字符的第一个字节
1110xxxx = 3字节字符的第一个字节
11110xxx = 4字节字符的第一个字节

Answer 2

有很多方法可以检测多字节字符，不幸的是......它们都不可靠。

如果这是一个返回的Web请求，请检查标题，因为Content-Type标题通常会指示页面编码（可以表示多字节字符存在）。

您还可以检查物料清单，因为它们是无效字符，不管怎样它们都不应出现在普通文本中，所以看它们是否在那里也不会有什么坏处。但是，它们是可选的，很多次都不会出现（取决于实现，配置等）。

Answer 3

BOM主要是可选的。如果您接收的服务器正在提供多字节字符，则可能会认为您知道这一点，并为BOM保存2个字节。您是否想要一种方法来判断您收到的数据是否可能是多字节字符串？

Answer 4

在UTF-8中，任何第8位的内容都是多字节代码点的一部分。因此，基本上检查每个字节的(0x80 & c)!=0是一种简单的方法。

Answer 5

让我实施dan04's answer。

此后，我使用C ++ 14。如果只能使用旧版本的C ++，则必须将binary literals（例如0b10）重写为整数文字（例如2）。

实施

int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
    if ((c >> 7) == 0b1) {
        if ((c >> 6) == 0b10) {
            return 2; //2nd, 3rd or 4th byte of a utf-8 character
        } else {
            return 1; //1st byte of a utf-8 character
        }
    } else {
        return 0; //a single byte character (not a utf-8 character)
    }
}

示例

代码

using namespace std;
#include <iostream>

namespace N {

    int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
        if ((c >> 7) == 0b1) {
            if ((c >> 6) == 0b10) {
                return 2; //2nd, 3rd or 4th byte of a utf-8 character
            } else {
                return 1; //1st byte of a utf-8 character
            }
        } else {
            return 0; //a single byte character (not a utf-8 character)
        }
    }

    unsigned get_string_length(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) != 2) {
                ++width;
            }
        }
        return width;
    }

    unsigned get_string_display_width(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) == 0) {
                width += 1;
            } else if (is_utf8_character(s[i]) == 1) {
                width += 2; //We assume a multi-byte character consumes double spaces than a single-byte character.
            }
        }
        return width;
    }

}

int main() {

    const string s = "こんにちはhello"; //"hello" is "こんにちは" in Japanese.

    for (int i = 0; i < s.size(); ++i) {
        cout << N::is_utf8_character(s[i]) << " ";
    }
    cout << "\n\n";

    cout << "       Length: " << N::get_string_length(s) << "\n";
    cout << "Display Width: " << N::get_string_display_width(s) << "\n";

}

输出

1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 0 0 0 0 0 

       Length: 10
Display Width: 15

测试char * string是否包含多字节字符

5 个答案:

实施

示例

代码

输出