Question

我需要一个C ++程序来添加读取文件的能力。我发现它不适用于欧洲特殊角色。我正在使用的例子是瑞典字符。

我更改了代码以使用宽字符，但这似乎没有帮助。

我正在阅读的示例文本文件包含以下内容：

"NEW-DATA"="Nysted Vi prøver lige igen"

这是在Windows和Nodepad上说这个文件使用的是UTF-8编码。

在Visual Studio中，调试时，正在显示读取的字符串，就好像字符串是ASCII格式一样：

ï»¿"NEW-DATA"="Nysted Vi prÃ¸ver lige igen"

我更改了代码以使用＆＃34; wide＆＃34;方法：

    std::wifstream infile;
    infile.open(argv[3], std::wifstream::in);
    if (infile.is_open())
    {
        std::wstring line;
        while (std::getline(infile, line))
        {

...

为了让它正确识别UTF-8，我还需要做些什么吗？

Answer 1

您可以将UTF-8内容读取为ASCII文本，但必须将它们转换为宽字符，以允许Visual Studio将其解释为unicode。

这是我们用于此的股票函数：

BSTR UTF8ToBSTR(char const* astr)
{
   static wchar_t wstr[BUFSIZ];

   // Look for the funtion description in MSDN.
   // Use of CP_UTF8 indicates that the input is UTF8 string.

   // Get the size of the output needed for the conversion.
   int size = MultiByteToWideChar(CP_UTF8, 0, astr, -1, NULL, 0);

   // Do the conversion and get the output.
   MultiByteToWideChar(CP_UTF8, 0, astr, -1, wstr, size);

   // Allocate memory for the BSTR and return the BSTR.
   return SysAllocString(wstr);
}

您必须添加代码才能释放调用SysAllocString(wstr)分配的内存。

E.g。

BSTR bstr = UTF8ToBSTR(...);

// Use bstr
// ...


// Deallocate memory
SysFreeString(bstr);

Answer 2

发生的事情是你有一个UTF-8编码的文件但是你试图把它读起来好像它是由宽字符组成的。那不行。如您所见，BOF标记已逐字读入您的字符串，因此显然，您使用的机制不包含任何尝试对字符进行解析和解码UTF-8字节对的逻辑。

宽字符和UTF-8是两个根本不同的东西。你不可能只是通过阅读UTF-8 在wchar_t（或std::wstring）中阅读并阅读。你是需要使用某种unicode库。有 C ++ 11中的std::wstring_convert（但需要工具支持）和有手动mbstowcs()/wcstombs()路线。它到处都是最好使用图书馆。

来源： https://www.reddit.com/r/cpp/comments/108o7g/reading_utf8_encoded_text_files_to_stdwstring/

我认为mbstowcs()/wcstombs()是微软MultiByteToWideChar()和MultiByteToWideChar()的便携式替代品。

C ++ UTF-8瑞典字符作为ASCII读取

2 个答案: