Question

我有一个完美的正则表达式。

^SENT KV(?<singlelinedata> L(?<line>[1-9]\d*) (?<measureline>\d+)(?: (?<samplingpoint>\d+))+)+$

我的输入字符串如下所示：

SENT KV L1 123 1 2 3 L2 456 4 5 6

唯一的问题是：如何获取“采样点”组的所有捕获的上下文？

这个组包含6个捕获，但我也需要上下文信息。第一次捕获“singlelinedata”时有三次捕获，第二次捕获中有三次捕获。如何获得这些信息？

组的捕获不包含包含所有包含的组的捕获的属性。

我知道我可以编写一个正则表达式来匹配整个字符串并执行第二个正则表达式来解析所有“singlelinedata”-captures。

我正在寻找一种适用于指定正则表达式的方法。

希望有人可以帮助我。

Answer 1

正则表达式API中没有“子组”的概念。一个群组可以有多个抓取，但您无法知道哪个samplingpoint属于哪个line。

您唯一的选择是使用字符索引自行计算。

Answer 2

void Main()
{
    string data = @"SENT KV L1 123 1 2 3 L2 456 4 5 6";
    Parse(data).Dump();
}

public class Result
{
    public int Line;
    public int MeasureLine;
    public List<int> SamplingPoints;
}

private Regex pattern = new Regex(@"^SENT KV(?<singlelinedata> L(?<line>[1-9]\d*) (?<measureline>\d+)(?: (?<samplingpoint>\d+))+)+$", RegexOptions.Multiline);

public IEnumerable<Result> Parse(string data)
{
    foreach (Match m in pattern.Matches(data))
    {
        foreach (Capture c1 in m.Groups["singlelinedata"].Captures)
        {
            int lineStart = c1.Index;
            int lineEnd = c1.Index + c1.Length;

            var result = new Result();
            result.Line = int.Parse(m.Groups["line"].CapturesWithin(c1).First().Value);
            result.MeasureLine = int.Parse(m.Groups["measureline"].CapturesWithin(c1).First().Value);

            result.SamplingPoints = new List<int>();
            foreach (Capture c2 in m.Groups["samplingpoint"].CapturesWithin(c1))
            {
                result.SamplingPoints.Add(int.Parse(c2.Value));
            }

            yield return result;
        }
    }
}

public static class RegexExtensions
{
    public static IEnumerable<Capture> CapturesWithin(this Group group, Capture capture)
    {
        foreach (Capture c in group.Captures)
        {
            if (c.Index < capture.Index) continue;
            if (c.Index >= capture.Index + capture.Length) break;

            yield return c;
        }
    }
}

修改：在Group上重写为扩展方法。

Answer 3

没有进行大量索引匹配并保留单个正则表达式的一种方法是将捕获组更改为具有相同名称的所有捕获组。嵌套的捕获实际上首先被压入堆栈，所以你得到一个这样的数组：

["1", "123", "1", "2", "3", "L1 123 1 2 3", "2", "456", "4", "5", "6", "L2 456 4 5 6"]

然后，当发现包含L的捕获然后从每个组中提取数据时，将结果分成组时，只需要一些LINQ疯狂的问题。

var regex = new Regex(@"^SENT KV(?<singlelinedata> L(?<singlelinedata>[1-9]\d*) (?<singlelinedata>\d+)(?: (?<singlelinedata>\d+))+)+$");
var matches = regex.Matches("SENT KV L1 123 1 2 3 L2 456 4 5 6 12 13 L3 789 7 8 9 10");
var singlelinedata = matches[0].Groups["singlelinedata"];

string groupKey = null;
var result = singlelinedata.Captures.OfType<Capture>()
    .Reverse()
    .GroupBy(key => groupKey = key.Value.Contains("L") ? key.Value : groupKey, value => value.Value)
    .Reverse()
    .Select(group => new { key = group.Key, data = group.Skip(1).Reverse().ToList() })
    .Select(item => new { line = item.data.First(), measureline = item.data.Skip(1).First(), samplingpoints = item.data.Skip(2).ToList() })
    .ToList();

Answer 4

基于 Markus Jarderot 的答案，我为群组编写了一个扩展方法，捕获并返回指定捕获中该组的所有捕获。

扩展方法如下所示：

    public static IEnumerable<Capture> CapturesWithin(this Group source, Capture captureContainingGroup)
    {
        var lowerIndex = captureContainingGroup.Index;
        var upperIndex = lowerIndex + captureContainingGroup.Length - 1;

        foreach (var capture in source.Captures.Cast<Capture>())
        {
            if (capture.Index < lowerIndex)
            {
                continue;
            }

            if (capture.Index > upperIndex)
            {
                break;
            }

            yield return capture;
        }
    }

使用此方法：

foreach (var capture in match.Groups["singlelinedata"].Captures.Cast<Capture>())
{
    var samplingpoints = match.Groups["samplingpoint"].CapturesWithin(capture).ToList();
    ...

正则表达式：多次捕获中的多次捕获

4 个答案: