正则表达式:在正则表达式解析器中解析锚字符

时间:2016-03-02 20:49:29

标签: regex parsing syntax compiler-construction

我正在尝试开发一个正则表达式解析器,我无法弄清楚如何在递归下降解析器中处理锚字符($)。

这是我的BNF:

 * regex    = "^" regex
            | regex "$"
            | term ( "|" term)*
 * term     = concat concat*

 * concat   = element [*]
 *          | element [+]
 *          | element [?]
 *          | element "{" int* "}"
 *          | element "{" int* "," "}"
 *          | element "{" int* "," int* "}"
 * element  = "(" regex")" | escaped_char | range | int | metacharacter | char
 * ranges   = "[" range* "]"
 * range    =  char "-" char         
 * metacharacter = ...
 * escaped_char = ...
 * int = 0 .. 9
 * char = ascii char

更具体地说,如何只用一个前瞻来处理$

最后一个concat节点需要通过$符号捕获,是否可以在递归下降解析器中处理它?或者我可能需要使用其他解析算法?

我想到的是这一点(如果有助于澄清我的问题,可以发表一些评论):

function parse_term token_list : (ast * token_list) = 
    next_token = lookahead token_list

    if next_token is '^'
        consume_tok ()
        anchor_group = parse_concat token_list

        if next_token is in follow_set(concat)
            consume_tok ()
            concat1 =  parse_concat token_list

            while next_token is in follow_set(concat)
                consume_tok ()
                concat1 = construct_concat (concat1, parse_concat token_list)

            return construct_concat (anchor_group, concat1)

        else 
            return anchor_group

    else if next_token is in follow_set(concat):

        concat1 =  parse_concat token_list

        while next_token is in follow_set(concat)
            consume_tok ()
            concat1 = construct_concat (concat1, parse_concat token_list)

        ------------------------------------------------
        here I need to handle the $ metacharacter for the last concat node
        but it has been handled in the while loop already.
        ----------------------------------------------------
        ....

0 个答案:

没有答案