表情符号转换为特定的字符串表示形式

时间:2018-07-12 16:06:42

标签: c# .net unicode

当前,我正在使用称为 Emoji 的元组哈希集将Emoji替换为字符串表示形式,例如,炸弹的emoji变成U0001F4A3。转换是通过

完成的
Emoji.Aggregate(input, (current, pair) => current.Replace(pair.Item1, pair.Item2));

按预期工作。

但是,我试图在不使用预定义的2600多个项目列表的情况下实现同一目标。是否有人已经实现了将字符串中的表情符号替换为对应的字符而又不带\的事情?

例如:

"This string contains the unicode character bomb ()"

成为

"This string contains the unicode character bomb (U0001F4A3)"

2 个答案:

答案 0 :(得分:2)

听起来您很高兴用其十六进制表示形式替换basic multi-lingual plane中没有的任何字符。这样做的代码有点冗长,但是很简单:

using System;
using System.Text;

class Test
{
    static void Main()
    {
        string text = "This string contains the unicode character bomb (\U0001F4A3)";
        Console.WriteLine(ReplaceNonBmpWithHex(text));
    }

    static string ReplaceNonBmpWithHex(string input)
    {
        // TODO: If most string don't have any non-BMP characters, consider
        // an optimization of checking for high/low surrogate characters first,
        // and return input if there aren't any.
        StringBuilder builder = new StringBuilder(input.Length);
        for (int i = 0; i < input.Length; i++)
        {
            char c = input[i];
            // A surrogate pair is a high surrogate followed by a low surrogate
            if (char.IsHighSurrogate(c))
            {
                if (i == input.Length -1)
                {
                    throw new ArgumentException($"High surrogate at end of string");
                }
                // Fetch the low surrogate, advancing our counter
                i++;
                char d = input[i];
                if (!char.IsLowSurrogate(d))
                {
                    throw new ArgumentException($"Unmatched low surrogate at index {i-1}");
                }
                uint highTranslated = (uint) ((c - 0xd800) * 0x400);
                uint lowTranslated = (uint) (d - 0xdc00);
                uint utf32 = (uint) (highTranslated + lowTranslated + 0x10000);
                builder.AppendFormat("U{0:X8}", utf32);
            }
            // We should never see a low surrogate on its own
            else if (char.IsLowSurrogate(c))
            {
                throw new ArgumentException($"Unmatched low surrogate at index {i}");
            }
            // Most common case: BMP character; just append it.
            else
            {
                builder.Append(c);
            }
        }
        return builder.ToString();
    }
}

请注意,按照Yury的回答,这不是 not 的尝试,以解决同时使用多个字符的情况。它将每个修饰符/表情符号/辅助字符替换为单独的UXXXXXXXX部分。

答案 1 :(得分:0)

恐怕您在这里有一个错误的假设。表情符号不仅是“特殊的Unicode字符”。特定表情符号的实际长度可以连续4个或更多字符。例如:

  • 表情符号本身
  • 零宽度联合机
  • 辅助字符(例如毕业帽或麦克风)
  • 性别修饰语(男人或女人)
  • 肤色修正剂(菲茨帕特里克比例)

因此,您应该确定考虑可变长度。

示例: