我想将c#源代码的关键字与正则表达式匹配。 假设我有“新”关键字。我想匹配不在“”,//(评论)和/ * * /(第二条评论)内的所有“新”关键字
到目前为止我写过:
\b[^\w@]new\b
然而它不适用于:
new[]
var a = new[] { "bla" };
var string = "new"
foo(); // new
/* new */
如何改进该正则表达式?
答案 0 :(得分:2)
捕获所有不良匹配和所有好东西会更容易。然后在编程逻辑测试中查看是否填充了一个捕获组,如果是,那么它就是你想要的匹配。
此表达式将:
"new"
或'new'
/* new */
// new
new
,var
和foo
(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|'[^']*')|(new|var|foo)|(\w+)
我不知道c#,所以我提供了一个powershell示例来演示如何实现这一目标。我使表达式为insensitve并使用(?is)
打开“点匹配新行”并且必须将表达式中的所有单引号转义为''
。
<强>代码强>
$String = 'NEW[]
var a = NEw[] { "bla" };
var string = "new"
foo(); // new
/*
new
*/
'
clear
[regex]$Regex = '(?is)(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|''[^'']*'')|(new|var|foo)|(\w+)'
# cycle through all matches
$Regex.matches($String) | foreach {
# Capture group 1 collects the comments, if populated then this match is a comment
if ($_.Groups[1].Value) {
Write-Host "comment at " $_.Groups[1].index " with a value => " $_.Groups[1].Value
} # end if
# capture group 2 collects the quoted strings, if populated then this match is a quoted string
if ($_.Groups[2].Value) {
Write-Host "quoted string at " $_.Groups[2].index " with a value => " $_.Groups[2].Value
} # end if
# capture group 3 collects keywords like new, var, and foo, if populated then this match is a keyword
if ($_.Groups[3].Value) {
Write-Host "keyword at " $_.Groups[3].index " with a value => " $_.Groups[3].Value
} # end if
# capture group 4 collects all the other word character chunks, so these might be variable names
if ($_.Groups[4].Value) {
Write-Host "possible variable name at " $_.Groups[4].index " with a value => " $_.Groups[4].Value
} # end if
} # next match
<强>输出强>
keyword at 0 with a value => NEW
keyword at 7 with a value => var
possible variable name at 11 with a value => a
keyword at 15 with a value => NEw
quoted string at 23 with a value => "bla"
keyword at 33 with a value => var
possible variable name at 37 with a value => string
quoted string at 46 with a value => "new"
keyword at 53 with a value => foo
comment at 60 with a value => // new
comment at 68 with a value => /*
new
*/
答案 1 :(得分:1)
简单,使用lexer。词法分析器在字符串中查找文本组,并从这些组中生成标记。然后为令牌提供“类型”。 (确定它是什么的东西)
C#关键字是定义的C# keywords之一。
一个简单的正则表达式将定义边框,后跟一个可能的C#关键字。 ("\b(new|var|string|...)\b"
)
您的词法分析器会在给定字符串中找到关键字的所有匹配项,为每个匹配项创建一个标记,并说明标记"type"
为"keyword"
。
但是,正如您所说,您不希望在引号或注释中找到关键字。 这是词法分析者真正获得积分的地方。
要解决此问题,(基于正则表达式)词法分析器将使用两种方法:
词法分析器按以下步骤工作:
剧透警告 下面是一个功能齐全的词法分析器。它将演示词法分析器的工作原理,因为它是一个功能齐全的词法分析器。
例如:
给定字符串,注释和关键字的正则表达式,显示词法分析器如何解决它们之间的冲突。
//Simple Regex for strings
string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";
//Simple Regex for comments
string CommentRegex = @"//.*|/\*[\s\S]*\*/";
//Simple Regex for keywords
string KeywordRegex = @"\b(?:new|var|string)\b";
//Create a dictionary relating token types to regexes
Dictionary<string, string> Regexes = new Dictionary<string, string>()
{
{"String", StringRegex},
{"Comment", CommentRegex},
{"Keyword", KeywordRegex}
};
//Define a string to tokenize
string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";
//Lexer steps:
//1). Find all of the matches from the regexes
//2). Convert them to tokens
//3). Order the tokens by index then priority
//4). Loop through each of the tokens comparing
// the current match with the next match,
// if the next match is partially contained by this match
// (or if they both occupy the same space) remove it.
//** Sorry for the complex LINQ expression (not really) **
//Match each regex to the input string(Step 1)
var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
//Cast each match because MatchCollection does not implement IEnumerable<T>
.Cast<Match>()
//Select a new token for each match(Step 2)
.Select(b =>
new
{
Index = b.Index,
Value = b.Value,
Type = a.Key //Type is based on the current regex.
}))
//Order each token by the index (Step 3)
.OrderBy(a => a.Index).ToList();
//Loop through the tokens(Step 4)
for (int i = 0; i < matches.Count; i++)
{
//Compare the current token with the next token to see if it is contained
if (i + 1 < matches.Count)
{
int firstEndPos = (matches[i].Index + matches[i].Value.Length);
if (firstEndPos > matches[(i + 1)].Index)
{
//Remove the next token from the list and stay at
//the current match
matches.RemoveAt(i + 1);
i--;
}
}
}
//Now matches contains all of the right matches
//Filter the matches by the Type to single out keywords from comments and
//string literals.
foreach(var match in matches)
{
Console.WriteLine(match);
}
Console.ReadLine();
这是一个有效的(我测试过的)差不多完整的词法分析器。(随意使用它或自己编写)它会找到你在正则表达式中定义的所有关键字,而不是将它们与字符串文字或注释混淆