Question

我的问题非常复杂，但可以归结为一个简单的例子。

我正在编写一种自定义查询语言，用户可以在其中输入我解析为LinQ表达式的字符串。

我希望能够通过*字符拆分字符串，除非它被正确转义。

Input         Output                          Query Description
"*\\*"    --> { "*", "\\", "*" }       -- contains a '\'
"*\\\**"  --> { "*", "\\\*", "*" }     -- contains '\*'
"*\**"    --> { "*", "\*", "*" }       -- contains '*' (works now)

我不介意Regex.Split返回空字符串，但我最终得到了这个：

Regex.Split(@"*\\*", @"(?<!\\)(\*)")  --> {"", "*", "\\*"}

正如你所看到的，我尝试过负面的lookbehind，它适用于我的所有情况，除了这个。我也试过Regex.Escape，但没有运气。

显然，我的问题是我正在寻找\*，\\*匹配。但在这种情况下， \\是另一个转义序列。

任何解决方案都不一定要涉及正则表达式。

Answer 1

我认为匹配要比拆分容易得多，特别是因为你没有从初始字符串中删除任何东西。那么匹配什么？除了未转义的*之外的所有内容。

怎么做？使用以下正则表达式：

@"(?:[^*\\]+|\\.)+|\*"

(?:[^*\\]+|\\.)+匹配所有非*或任何转义字符的内容。无需任何外观。

\*将与分隔符匹配。

在代码中：

using System;
using System.Text.RegularExpressions;
using System.Linq;
public class Test
{
    public static void Main()
    {   
        string[] tests = new string[]{
            @"*\\*",
            @"*\\\**",
            @"*\**",
        };

        Regex re = new Regex(@"(?:[^*\\]+|\\.)+|\*");

        foreach (string s in tests) {
            var parts = re.Matches(s)
             .OfType<Match>()
             .Select(m => m.Value)
             .ToList();

            Console.WriteLine(string.Join(", ", parts.ToArray()));
        }
    }
}

输出：

*, \\, *
*, \\\*, *
*, \*, *

ideone demo

Answer 2

我想出了这个正则表达式(?<=(?:^|[^\\])(?:\\\\)*)(\*)。

说明：

您只需列出*之前可能发生的情况，其中包括：

字符串^
不是\ - [^\\]
（不是\或字符串的开头），偶数\ - (^|[^\\])(\\\\)*

测试代码和示例：

string[] tests = new string[]{
    @"*\\*",
    @"*\\\**",
    @"*\**",
    @"test\**test2",
};

Regex re = new Regex(@"(?<=(?:^|[^\\])(?:\\\\)*)(\*)");

foreach (string s in tests) {
    string[] m = re.Split( s );
    Console.WriteLine(String.Format("{0,-20} {1}", s, String.Join(", ",
       m.Where(x => !String.IsNullOrEmpty(x)))));
}

结果：

*\\*                 *, \\, *
*\\\**               *, \\\*, *
*\**                 *, \*, *
test\**test2         test\*, *, test2

Answer 3

我认为一个纯粹的解析，非正则表达式解决方案将是这个问题的一个很好的补充。

我可以比我理解任何正则表达式更快地阅读它。这也使得修复意外的角落容易。逻辑直接布局。

public static String[] splitOnDelimiterWithEscape(String toSplit, char delimiter, char escape) {
    List<String> strings = new ArrayList<>();

    char[] chars = toSplit.toCharArray();
    String sub = "";

    for(int i = 0 ; i < chars.length ; i++) {
        if(chars[i] == escape) {
            sub += (i+1 < chars.length) ? chars[++i] : ""; //assign whatever char is after the escape to the string. This essentially makes single escape character non-existent. It just forces the next character to be literal. If the escape is at end, then we just ignore it

            //this is the simplest implementation of the escape. If escaping certain characters should have
            //special behaviour it should be implemented here.

            //You could even pass a Map mapping escape characters, to literal characters to make this even 
            //more general.

        } else if(chars[i] == delimiter) {
            strings.add(sub); //Found delimiter. So we split.
            sub = "";
        } else {
            sub += chars[i]; //nothing special. Just append to current string.
        }
    }

    strings.add(sub); //end of string is a boundary. Must include.

    return strings.toArray(new String[strings.size()]);
}

更新：我现在对这个问题有点困惑。正如我所知，分裂不包括分隔（但看起来像你的例子那样）。如果你想让分隔符存在于数组中，在它们自己的插槽中，那么对它的修改就相当简单了。（我将把它留作读者的练习作为代码可维护性的证据）

正则表达式 - 逃脱转义字符

3 个答案:

说明：

测试代码和示例：

结果：