Question

我的语法产生了意想不到的结果。我不确定这只是我的错误或ANTLR的模糊替代处理逻辑的一些问题。

这是我的语法：

    grammar PPMacro;
options {
  language=Java;
  backtrack=true;

}

file: (inputLines)+ EOF;

inputLines 
:  ( preprocessorLineSet  |  oneNormalInputLine )  ; 

oneNormalInputLine  @after{System.out.print("["+$text+"]");}  
: (any_token_except_crlf)* CRLF ;

preprocessorLineSet 
: ifPart endifLine;

ifPart: ifLine  inputLines*   ;
ifLine  @after{System.out.print("{"+$text+"}" );} 
:  '#' IF (any_token_except_crlf)* CRLF ;

endifLine @after{System.out.print("{"+$text+"}" );} 
:  '#' ENDIF (any_token_except_crlf)* CRLF ;

any_token_except_crlf: (ANY_ID | WS | '#'|IF|ENDIF);
// just matches everything

CRLF: '\r'?  '\n'  ;
WS: (' '|'\t'|'\f' )+;
Hash: '#'  ;
IF     : 'if'    ;
ENDIF  : 'endif' ;
ANY_ID: ( 'a'..'z'|'A'..'Z'|'0'..'9'| '_')+ ;

解释：

用于解析C ++ #if ... #endif块

我正在尝试识别嵌套的#if #endif块。这是由我的 preprocessorLineSet 完成的。它包含一个支持嵌套块的递归定义。 oneNormalInputLine 是处理#if形式的任何东西。此规则是匹配任何规则，实际上匹配#if行。但我故意将它放在 inputLines 中的 preprocessorLineSet 之后。我期望这种排序可以防止它匹配#if或#endif行。使用全能规则的原因是我希望规则接受任何其他c ++语法并简单地将它们回显给输出。

我的测试，我只打印出一切。由 preprocessorLineSet 匹配的行应该被{}包围，而由 oneNormalInputLine 匹配的行应该被[]包围。

示例输入：

#if s
s
#if a
s 
s
#endif
#endif

和这个

#if
abc
#endif

相应的输出：

[#if s
][s
][#if a
][s
][s
][#endif
][#endif
]

和这个

[#if
][abc
][#endif
]

问题：

包括#if #endif在内的所有输出行都被[]包围，这意味着它们仅由 oneNormalInputLine 匹配！但我并不期待这一点。 preprocessorLineSet 应该能够匹配#if行。为什么我得到这个结果？

此行包含不明确的替代方案：

inputLines  :  ( preprocessorLineSet  |  oneNormalInputLine );

因为两者都可以匹配#if和#endif。但我期待应该使用第一种替代品而不是后者。另请注意，回溯已开启。

EDIT 我的 oneNormalInputLine 规则接受一切的原因是很难表达没有特定模式的东西，因为#if模式可能相当复杂：

/***

comments

*/   # /***
comments
*/ if

是一种有效的模式。编写一个没有这种模式的规则似乎很难。

Answer 1

你的方法并不是很强大 - 我建议你保持简单并使用实际的语言规则，该规则表明以#开头的每一行都是一个预处理器指令，而那个不是'以#开头不是。使用此规则的语法不会有歧义，理解起来要简单得多。

为什么你的语法不起作用？问题是您的preprocesstoLineSet规则无法匹配任何内容。

preprocessorLineSet 
: ifPart endifLine;

ifPart: ifLine  inputLines*   ;

它以#if ...开头，然后应与其他线匹配，并且当第一个匹配的#endif到来时，它应匹配并完成。但是，它实际上并没有这样做。 inputLines几乎可以匹配任何行（几乎 - 它不匹配，例如.C ++的运算符和其他非标识符），包括所有预处理程序指令。这意味着ifPart规则将与输入的结尾匹配，并且不会留下endifLine。请注意，回溯对此没有影响，因为一旦ANTLR匹配规则（在这种情况下ifPart，它将在整个输入的其余部分成功，因为*是贪婪的），它将永远不会回溯进去。 ANTLR的回溯规则很毛茸茸......

请注意，如果您使oneNormalLine 不匹配预处理程序指令（例如，它将类似于(nonHash any*| ) CRLF，则它将开始工作。

Answer 2

您的any_token_except_crlf导致歧义。您需要通过让该规则与以下内容匹配来解决此问题（并删除backtrack=true;）：

空间的字符;
'#'后跟除'if'，'endif'和换行符以外的任何内容;
除'#'以外的任何字符和换行符，后跟'if'或'endif'
标识符。

一个小工作示例（我对规则的命名有点不同......）：

grammar PPMacro;

options {
  output=AST;
}

tokens {
  FILE;
}

file
  :  line+ EOF -> ^(FILE line+)
  ;

line
  :  if_stat
  |  normal_line
  ;

if_stat
  :  HASH IF normal_line line* HASH ENDIF -> ^(IF normal_line line*)
  ;

normal_line
  :  non_special* CRLF -> non_special*
  ;

non_special
  :  WS
  |  HASH ~(IF | ENDIF | CRLF)
  |  ~(HASH | CRLF) (IF | ENDIF)
  |  ID
  ;

CRLF  : '\r'?  '\n'  ;
WS    : (' ' | '\t' | '\f')+;
HASH  : '#'  ;
IF    : 'if'    ;
ENDIF : 'endif' ;
ID    : ( 'a'..'z'|'A'..'Z'|'0'..'9'| '_')+ ;

这可以通过课程测试：

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PPMacroLexer lexer = new PPMacroLexer(new ANTLRFileStream("test.cpp"));
    PPMacroParser parser = new PPMacroParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.file().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

和test.cpp文件可能如下所示：

a b
#if s
t
#if a
u 
v
#endif
#endif
c
d

将产生以下AST：

enter image description here

修改

我刚刚看到您要考虑#和if（以及endif）之间的多行注释和空格。你可以在词法分析器中处理这样的事情，如下所示：

grammar PPMacro;

options {
  output=AST;
}

tokens {
  FILE;
  ENDIF;
}

file
  :  line+ EOF -> ^(FILE line+)
  ;

line
  :  if_stat
  |  normal_line
  ;

if_stat
  :  IF normal_line line* ENDIF -> ^(IF normal_line line*)
  ;

normal_line
  :  non_special* CRLF -> non_special*
  ;

non_special
  :  WS
  |  ID
  ;

IF      : '#' NOISE* ('if' | 'endif' {$type=ENDIF;});
CRLF    : '\r'?  '\n';
WS      : (' ' | '\t' | '\f')+;
ID      : ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+;
COMMENT : '/*' .* '*/' {skip();};

fragment NOISE
  :  '/*' .* '*/'
  |  WS
  ;

fragment ENDIF : ;

将解析以下输入：

a b
# /* 
comment 
*/ if s
t
#    if a
u 
v
#      /*
another 
comment */  endif
#endif
c
d

和我上面发布的几乎相同的AST。

ANTLR语法与歧义替代

2 个答案:

修改