Question

我正在编写PCRE正则表达式，目的是“最小化”用free-spacing and comments mode（/x标志）编写的其他PCRE正则表达式，例如：

# Match a 20th or 21st century date in yyyy-mm-dd format
(19|20)\d\d                # year (group 1)
[- /.]                     # separator - dash, space, slash or period
(0[1-9]|1[012])            # month (group 2)
[- /.]                     # separator - dash, space, slash or period
(0[1-9]|[12][0-9]|3[01])   # day (group 3)

注意：我故意省略了任何正则表达式定界符和x标志

“最小化”上面的表达式的结果应该是删除所有文字空白字符（包括新行）和注释，但字符类中的文字空间（例如[- /.]）和转义空白字符（例如\）：

(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

这是我到目前为止使用的正则表达式，它本身以自由间距和注释模式（https://regex101.com/r/RHnyWw/2/）编写：

(?<!\\)\s          # Match any non-escaped whitespace character
|
(?<!\\)\#.*\s*$    # Match comments (any text following non-escaped #)

假设我用空字符串替换所有匹配项，结果是：

(19|20)\d\d[-/.](0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])

这很接近，除了模式的分隔符[- /.]部分的空格字符丢失了文字空间。

如何更改此模式，以保留带有#和[的文字空间（和]）字符？

Answer 1

也许这个正则表达式可以帮助

(?:\[(?:[^\\\]]++|\\.)*+\]|\\.)(*SKIP)(*F)|\#.*?$|\s++

Answer 2

这是我的解决方法：

# Match any literal whitespace character, except when within a valid character class
# at first position, or second position after `-`
(?<!\\|(?<!\\)\[|(?<!\\)\[-)\s 
|
# Match comments (any text following a literal # until end-of-line), except when
# within a character class at first position, or second position after `-` or third
# position after `- `
(?<!\\|(?<!\\)\[|(?<!\\)\[-|(?<!\\)\[\ |(?<!\\)\[-\ )\#.*$\r?\n?

缩小的结果是：

(?<!\\|(?<!\\)\[|(?<!\\)\[-)\s|(?<!\\|(?<!\\)\[|(?<!\\)\[-|(?<!\\)\[\ |(?<!\\)\[-\ )\#.*$\r?\n?

https://regex101.com/r/3EVpuH/1

此解决方案的一个优点是它不依赖于backtracking control verbs（直到我看到Michail's solution之后才知道它）。

（相对于Michail的解决方案）缺点是，如果要在字符类中指定破折号，空格和/或#字符，则它们必须以特定的顺序出现：破折号，空格然后是哈希，即{ {1}}。我不知道是否可以不使用控制动词就消除这一要求。

用于从空白和注释模式正则表达式中删除空格和注释的正则表达式

2 个答案: