匹配源代码关键字

时间:2013-07-11 19:44:29

标签: c# regex

我想将c#源代码的关键字与正则表达式匹配。 假设我有“新”关键字。我想匹配不在“”,//(评论)和/ * * /(第二条评论)内的所有“新”关键字

到目前为止我写过:

\b[^\w@]new\b

然而它不适用于

new[]
var a = new[] { "bla" };
var string = "new"
foo(); // new
/* new */

如何改进该正则表达式?

2 个答案:

答案 0 :(得分:2)

描述

捕获所有不良匹配和所有好东西会更容易。然后在编程逻辑测试中查看是否填充了一个捕获组,如果是,那么它就是你想要的匹配。

此表达式将:

  • 避免使用所有单引号和双引号的文本块,例如"new"'new'
  • 避免使用/* new */
  • 等所有块评论的部分
  • 避免所有单行评论// new
  • 任何未引用或评论的关键字,例如newvarfoo

(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|'[^']*')|(new|var|foo)|(\w+)

enter image description here

实施例

我不知道c#,所以我提供了一个powershell示例来演示如何实现这一目标。我使表达式为insensitve并使用(?is)打开“点匹配新行”并且必须将表达式中的所有单引号转义为''

<强>代码

$String = 'NEW[]
var a = NEw[] { "bla" };
var string = "new"
foo(); // new
/*
new
*/
'
clear

[regex]$Regex = '(?is)(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|''[^'']*'')|(new|var|foo)|(\w+)'

# cycle through all matches
$Regex.matches($String) | foreach {

    # Capture group 1 collects the comments, if populated then this match is a comment
    if ($_.Groups[1].Value) {
        Write-Host "comment at " $_.Groups[1].index " with a value => " $_.Groups[1].Value
        } # end if

    # capture group 2 collects the quoted strings, if populated then this match is a quoted string
    if ($_.Groups[2].Value) {
        Write-Host "quoted string at " $_.Groups[2].index " with a value => " $_.Groups[2].Value
        } # end if

    # capture group 3 collects keywords like new, var, and foo, if populated then this match is a keyword
    if ($_.Groups[3].Value) {
        Write-Host "keyword at " $_.Groups[3].index " with a value => " $_.Groups[3].Value
        } # end if

    # capture group 4 collects all the other word character chunks, so these might be variable names
    if ($_.Groups[4].Value) {
        Write-Host "possible variable name at " $_.Groups[4].index " with a value => " $_.Groups[4].Value
        } # end if

    } # next match

<强>输出

keyword at  0  with a value =>  NEW
keyword at  7  with a value =>  var
possible variable name at  11  with a value =>  a
keyword at  15  with a value =>  NEw
quoted string at  23  with a value =>  "bla"
keyword at  33  with a value =>  var
possible variable name at  37  with a value =>  string
quoted string at  46  with a value =>  "new"
keyword at  53  with a value =>  foo
comment at  60  with a value =>  // new

comment at  68  with a value =>  /*
new
*/

答案 1 :(得分:1)

简单,使用lexer。词法分析器在字符串中查找文本组,并从这些组中生成标记。然后为令牌提供“类型”。 (确定它是什么的东西)

C#关键字是定义的C# keywords之一。 一个简单的正则表达式将定义边框,后跟一个可能的C#关键字。 ("\b(new|var|string|...)\b"

您的词法分析器会在给定字符串中找到关键字的所有匹配项,为每个匹配项创建一个标记,并说明标记"type""keyword"

但是,正如您所说,您不希望在引号或注释中找到关键字。 这是词法分析者真正获得积分的地方。

要解决此问题,(基于正则表达式)词法分析器将使用两种方法:

  1. 删除其他匹配所包含的所有匹配项。
  2. 删除使用相同空格但匹配较低的匹配项 优先级。
  3. 词法分析器按以下步骤工作:

    1. 查找正则表达式中的所有匹配项
    2. 将它们转换为代币
    3. 按索引排序令牌
    4. 循环遍历每个标记比较  当前与下一场比赛的比赛,  如果下一场比赛被这场比赛部分包含  (或者如果它们都占据相同的空间)将其移除。
    5. 剧透警告 下面是一个功能齐全的词法分析器。它将演示词法分析器的工作原理,因为它是一个功能齐全的词法分析器。

      例如:

      给定字符串,注释和关键字的正则表达式,显示词法分析器如何解决它们之间的冲突。

      //Simple Regex for strings
      string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";
      
      //Simple Regex for comments
      string CommentRegex = @"//.*|/\*[\s\S]*\*/";
      
      //Simple Regex for keywords
      string KeywordRegex = @"\b(?:new|var|string)\b";
      
      //Create a dictionary relating token types to regexes
      Dictionary<string, string> Regexes = new Dictionary<string, string>()
      {
          {"String", StringRegex},
          {"Comment", CommentRegex},
          {"Keyword", KeywordRegex}
      };
      
      //Define a string to tokenize
      string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";
      
      
      //Lexer steps:
      //1). Find all of the matches from the regexes
      //2). Convert them to tokens
      //3). Order the tokens by index then priority
      //4). Loop through each of the tokens comparing
      //    the current match with the next match,
      //    if the next match is partially contained by this match
      //    (or if they both occupy the same space) remove it.
      
      
      //** Sorry for the complex LINQ expression (not really) **
      
      //Match each regex to the input string(Step 1)
      var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
      //Cast each match because MatchCollection does not implement IEnumerable<T>
      .Cast<Match>()
      //Select a new token for each match(Step 2)
      .Select(b => 
              new
              {
                  Index = b.Index,
                  Value = b.Value,
                  Type = a.Key //Type is based on the current regex.
              }))
      //Order each token by the index (Step 3)
      .OrderBy(a => a.Index).ToList();
      
      //Loop through the tokens(Step 4)
      for (int i = 0; i < matches.Count; i++)
      {
          //Compare the current token with the next token to see if it is contained
          if (i + 1 < matches.Count)
          {
              int firstEndPos = (matches[i].Index + matches[i].Value.Length);
              if (firstEndPos > matches[(i + 1)].Index)
              {
                  //Remove the next token from the list and stay at
                  //the current match
                  matches.RemoveAt(i + 1);
                  i--;
              }
          }
      }
      
      //Now matches contains all of the right matches
      //Filter the matches by the Type to single out keywords from comments and
      //string literals.
      foreach(var match in matches)
      {
          Console.WriteLine(match);
      }
      Console.ReadLine();
      

      这是一个有效的(我测试过的)差不多完整的词法分析器。(随意使用它或自己编写)它会找到你在正则表达式中定义的所有关键字,而不是将它们与字符串文字或注释混淆