如何将Markdown代码块与RegEx匹配?

时间:2016-12-27 20:36:52

标签: regex markdown

我正在尝试使用PCRE RegEx从Markdown文档中提取code block。对于初学者,Markdown中的代码块因此被定义:

  

要在Markdown中生成代码块,只需缩进该行的每一行   阻挡至少4个空格或1个标签。   代码块一直持续到它没有缩进的行(或文章的结尾)。

所以,鉴于此文:

This is a code block:

    I need capturing along with
    this line

This is a code fence below (to be ignored):

``` json
This must have three backticks
flanking it
```

I love `inline code` too but don't capture

and one more short code block:

    Capture me

到目前为止,我有这个RegEx:

(?:[ ]{4,}|\t{1,})(.+)

但它只是捕获每个前缀至少四个空格或一个标签的行。它没有捕获整个块。

我需要帮助的是如何将条件设置为捕获4个空格或1个制表符后的所有内容,直到找到未缩进的行或文本的结尾

这是正在进行的在线工作:

https://www.regex101.com/r/yMQCIG/5

3 个答案:

答案 0 :(得分:5)

您应该使用开头/结束字符串标记(^$m修饰符一起使用)。此外,您的测试文本在最后一个块中只有3个前导空格:

^((?:(?:[ ]{4}|\t).*(\R|$))+)

使用\R并重复匹配每个匹配的整个块,而不是每个匹配的一行。

请参阅regex101

上的演示

免责声明:降价规则比呈现的示例文字更复杂。例如,当(嵌套)列表中包含代码块时,这些列表需要以8,12或更多空格为前缀。正则表达式不适用于识别此类代码块或嵌入在使用更广泛格式组合的降价符号中的其他代码块。

答案 1 :(得分:0)

尝试一下?

[a-z]*\n[\s\S]*?\n

它将从您的示例中提取

This must have three backticks
flanking it

答案 2 :(得分:0)

有3种方法来突出显示代码:1)使用行首缩进2)使用3个或更多反引号将多行代码块或3)内联代码括起来。 1和3是John Gruber original Markdown specification的一部分。
这是实现此目的的方法。您需要执行3个单独的regexp测试:

  1. 使用缩进

     (?:\n{2,}|\A)                   # Starting at beginning of string or with 2 new lines
     (?<code_all>
         (?:
             (?<code_prefix>         # Lines must start with a tab or a tab-width of spaces
                 [ ]{4}
                 |
                 \t
             )
             (?<code_content>.*\n+)  # with some content, possibly nothing followed by a new line
         )+
     )
     (?<code_after>
         (?=^[ ]{0,4}\S)             # Lookahead for non-space at line-start
         |
         \Z                          # or end of doc
     )
    

2a)使用带反引号的代码块(香草降价)

    (?:\n+|\A)?                         # Necessarily at the begining of a new line or start of string
    (?<code_all>
        (?<code_start>
            [ ]{0,3}                    # Possibly up to 3 leading spaces
            \`{3,}                      # 3 code marks (backticks) or more
        )
        \n+
        (?<code_content>.*?)            # enclosed content
        \n+
        (?<!`)
        \g{code_start}                  # balanced closing block marks
        (?!`)
        [ \t]*                          # possibly followed by some space
        \n
    )
    (?<code_trailing_new_line>\n|\Z)    # and a new line or end of string

2b)使用带有某些类说明符的反引号的代码块(扩展降价)

    (?:\n+|\A)?                 # Necessarily at the beginning of a new line
    (?<code_all>
        (?<code_start>
            [ ]{0,3}            # Possibly up to 3 leading spaces
            \`{3,}              # 3 code marks (backticks) or more
        )
        [ \t]*                  # Possibly some spaces or tab
        (?:
            (?:
                (?<code_class>[\w\-\.]+)    # or a code class like html, ruby, perl
                (?:
                    [ \t]*
                    \{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
                )?                          # Possibly followed by class and id definition in curly braces
            )
            |
            (?:
                [ \t]*
                \{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
            )                           # Followed by class and id definition in curly braces
        )
        \n+
        (?<code_content>.*?)    # enclosed content
        \n+
        (?<!`)
        \g{code_start}          # balanced closing block marks
        (?!`)
    )
    (?:\n|\Z)                # and a new line or end of string
  1. 使用1个或多个反引号作为内联代码

     (?<!\\)                     # Ensuring this is not escaped
     (?<code_all>
         (?<code_start>\`{1,})   # One or more backtick(s)
         (?<code_content>.+?)    # Code content inbetween back sticks
         (?<!`)                  # Not preceded by a backtick
         \g{code_start}          # Balanced closing backtick(s)
         (?!`)                   # And not followed by a backtick
     )