Question

我正在使用C＃进行一些OCR工作，并提取了我需要使用的文本。现在我需要使用正则表达式解析一行。

string checkNum;
string routingNum;
string accountNum;
Regex regEx = new Regex(@"\u9288\d+\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\u9286\d{9}\u9286");
match = regEx.Match(numbers);
if(match.Success)
    routingNum = match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);
regEx = new Regex(@"\d{10}\u9288");
match = regEx.Match(numbers);
if (match.Success)
    accountNum = match.Value.Remove(match.Value.Length - 1, 1);

问题是，当我执行.ToCharArray()并检查字符串的内容时，字符串包含必要的Unicode字符，但是当我解析查找字符串的字符串时，它似乎永远不会识别Unicode字符。我认为C＃中的字符串默认是Unicode。

Answer 1

我明白了。我使用的是十进制值而不是十六进制代码换句话说，我应该使用\u9288 and \u9286 http://www.ssec.wisc.edu/~tomw/java/unicode.html#x2440

而不是\u2448 and \u2446。

谢谢你们带领我朝着正确的方向前进。

Answer 2

这一行：

match.Value.Remove(0, 1).Remove(match.Value.Length - 1, 1);

导致异常，因为第一个Remove的结果长度小于原始match.Value.Length。

我建议您使用组来提取值。例如：

Regex regEx = new Regex(@"\u9288(\d+)\u9288");
Match match = regEx.Match(numbers);
if (match.Success)
    checkNum = match.Groups[1].Value;

有了这个，我可以正确地提取值。

Answer 3

.NET中的字符串是UTF-16 encoded。

此外，Regex引擎与Unicode字符不匹配，但与Unicode代码点不匹配。请参阅this post。

在字符串上正则表达unicode字符

3 个答案: