用于管理字符串文字等项目的转义字符的正则表达式

时间:2009-01-10 08:45:10

标签: python regex

我希望能够将字符串文字与转义引用选项匹配。 例如,我希望能够搜索“这是一个'转换''值'确定'的测试”并让它正确识别反斜杠作为转义字符。我尝试过如下解决方案:

import re
regexc = re.compile(r"\'(.*?)(?<!\\)\'")
match = regexc.search(r""" Example: 'Foo \' Bar'  End. """)
print match.groups() 
# I want ("Foo \' Bar") to be printed above

看了这个之后,有一个简单的问题,即使用的转义字符“\”无法自行转义。我无法弄清楚如何做到这一点。我想要一个类似下面的解决方案,但负面的lookbehind断言需要固定长度:

# ...
re.compile(r"\'(.*?)(?<!\\(\\\\)*)\'")
# ...

任何能够解决这个问题的正则表达式大师?感谢。

6 个答案:

答案 0 :(得分:16)

re_single_quote = r“'[^'\\]*(?:\\.[^'\\]*)*'"

首先请注意,MizardX的答案是100%准确的。我想补充一些关于效率的额外建议。其次,我想注意这个问题很久以前就已经解决和优化了 - 请参阅:Mastering Regular Expressions (3rd Edition),(详细介绍了这个具体问题 - 高度推荐)。

首先让我们看一下子表达式,以匹配单个带引号的字符串,该字符串可能包含转义的单引号。如果你打算允许转义单引号,你最好至少允许转义转义(这是Douglas Leeder的回答)。但只要你在它,它就像逃避任何其他东西一样容易。有了这些要求。 MizardX是唯一一个表达正确的人。这里有短格式和长格式(我已经冒昧地用VERBOSE模式写这个,带有很多描述性的评论 - 你应该总是做非平凡的事情正则表达式):

# MizardX's correct regex to match single quoted string:
re_sq_short = r"'((?:\\.|[^\\'])*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # Capture group $1: Contents.
      (?:       # Group for contents alternatives
        \\.     # Either escaped anything
      | [^\\']  # or one non-quote, non-escape.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

这适用于所有以下字符串测试用例:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"

好的,现在让我们开始对此进行改进。首先,替代方案的顺序有所不同,应始终首先考虑最可能的替代方案。在这种情况下,非转义字符比转义字符更有可能,因此反转顺序将略微提高正则表达式的效率:

# Better regex to match single quoted string:
re_sq_short = r"'((?:[^\\']|\\.)*)'"
re_sq_long = r"""
    '           # Literal opening quote
    (           # $1: Contents.
      (?:       # Group for contents alternatives
        [^\\']  # Either a non-quote, non-escape,
      | \\.     # or an escaped anything.
      )*        # Zero or more contents alternatives.
    )           # End $1: Contents.
    '
    """

“开卷半实物”:

这稍微好一些,但可以通过应用Jeffrey Friedl的“展开循环”效率技术(来自MRE3)进一步改进(显着)。上面的正则表达式不是最优的,因为它必须将星形量化器精心地应用于两个备选方案的非捕获组,每个备选方案一次仅消耗一个或两个字符。通过认识到一遍又一遍地重复相似的模式,可以完全消除这种交替,并且可以制作等效表达式来做同样的事情而无需交替。这是一个优化的表达式,用于匹配单个带引号的字符串并将其内容捕获到组$1中:

# Better regex to match single quoted string:
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"
re_sq_long = r"""
    '            # Literal opening quote
    (            # $1: Contents.
      [^'\\]*    # {normal*} Zero or more non-', non-escapes.
      (?:        # Group for {(special normal*)*} construct.
        \\.      # {special} Escaped anything.
        [^'\\]*  # More {normal*}.
      )*         # Finish up {(special normal*)*} construct.
    )            # End $1: Contents.
    '
    """

这个表达式在一个“gulp”中吞噬所有非引号,非反斜杠(绝大多数大多数字符串),这大大减少了正则表达式引擎必须执行的工作量。你问多少钱?好吧,我将从这个问题中提出的每个正则表达式都输入RegexBuddy并测量了正则表达式引擎在以下字符串上完成匹配所花费的步数(所有解决方案都正确匹配):

'This is an example string which contains one \'internally quoted\' string.'

以下是上述测试字符串的基准测试结果:

r"""
AUTHOR            SINGLE-QUOTE REGEX   STEPS TO: MATCH  NON-MATCH
Evan Fosmark      '(.*?)(?<!\\)'                  374     376
Douglas Leeder    '(([^\\']|\\'|\\\\)*)'          154     444
cletus/PEZ        '((?:\\'|[^'])*)(?<!\\)'        223     527
MizardX           '((?:\\.|[^\\'])*)'             221     369
MizardX(improved) '((?:[^\\']|\\.)*)'             153     369
Jeffrey Friedl    '([^\\']*(?:\\.[^\\']*)*)'       13      19
"""

这些步骤是使用RegexBuddy调试器函数匹配测试字符串所需的步骤数。 “NON-MATCH”列是从测试字符串中删除结束引号时声明匹配失败所需的步骤数。如您所见,对于匹配和不匹配的情况,差异很大。另请注意,这些效率改进仅适用于使用回溯的NFA引擎(即Perl,PHP,Java,Python,Javascript,.NET,Ruby和其他大多数。)DFA引擎不会通过此技术获得任何性能提升(请参阅:Regular Expression Matching Can Be Simple And Fast)。

到完整的解决方案:

原始问题(我的解释)的目标是从较大的字符串中挑选出单引号子字符串(可能包含转义引号)。如果已知引用的子字符串之外的文本将永远不会包含转义单引号,则上面的正则表达式将完成这项工作。但是,为了正确匹配文本游泳海洋中的单引号子字符串与转义引号和转义转义以及转义任何东西(这是我对作者所追求的解释),需要解析从字符串的开头不,(这是我最初的想法),但它没有 - 这可以使用MizardX非常聪明的(?<!\\)(?:\\\\)*表达式来实现。以下是一些练习各种解决方案的测试字符串:

text01 = r"out1 'escaped-escape:        \\ ' out2"
test02 = r"out1 'escaped-quote:         \' ' out2"
test03 = r"out1 'escaped-anything:      \X ' out2"
test04 = r"out1 'two escaped escapes: \\\\ ' out2"
test05 = r"out1 'escaped-quote at end:   \'' out2"
test06 = r"out1 'escaped-escape at end:  \\' out2"
test07 = r"out1           'str1' out2 'str2' out2"
test08 = r"out1 \'        'str1' out2 'str2' out2"
test09 = r"out1 \\\'      'str1' out2 'str2' out2"
test10 = r"out1 \\        'str1' out2 'str2' out2"
test11 = r"out1 \\\\      'str1' out2 'str2' out2"
test12 = r"out1         \\'str1' out2 'str2' out2"
test13 = r"out1       \\\\'str1' out2 'str2' out2"
test14 = r"out1           'str1''str2''str3' out2"

鉴于此测试数据,让我们看看各种解决方案的表现如何('p'== pass,'XX'==失败):

r"""
AUTHOR/REGEX     01  02  03  04  05  06  07  08  09  10  11  12  13  14
Douglas Leeder    p   p  XX   p   p   p   p   p   p   p   p  XX  XX  XX
  r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"
cletus/PEZ        p   p   p   p   p  XX   p   p   p   p   p  XX  XX  XX
  r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'"
MizardX           p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'"
ridgerunner       p   p   p   p   p   p   p   p   p   p   p   p   p   p
  r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'"
"""

正在运行的测试脚本:

import re
data_list = [
    r"out1 'escaped-escape:        \\ ' out2",
    r"out1 'escaped-quote:         \' ' out2",
    r"out1 'escaped-anything:      \X ' out2",
    r"out1 'two escaped escapes: \\\\ ' out2",
    r"out1 'escaped-quote at end:   \'' out2",
    r"out1 'escaped-escape at end:  \\' out2",
    r"out1           'str1' out2 'str2' out2",
    r"out1 \'        'str1' out2 'str2' out2",
    r"out1 \\\'      'str1' out2 'str2' out2",
    r"out1 \\        'str1' out2 'str2' out2",
    r"out1 \\\\      'str1' out2 'str2' out2",
    r"out1         \\'str1' out2 'str2' out2",
    r"out1       \\\\'str1' out2 'str2' out2",
    r"out1           'str1''str2''str3' out2",
    ]

regex = re.compile(
    r"""(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'""",
    re.DOTALL)

data_cnt = 0
for data in data_list:
    data_cnt += 1
    print ("\nData string %d" % (data_cnt))
    m_cnt = 0
    for match in regex.finditer(data):
        m_cnt += 1
        if (match.group(1)):
            print("  quoted sub-string%3d = \"%s\"" %
                (m_cnt, match.group(1)))

呼!

P.S。感谢MizardX提供非常酷的(?<!\\)(?:\\\\)*表达式。每天都学到新的东西!

答案 1 :(得分:5)

我认为这会奏效:

import re
regexc = re.compile(r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'")

def check(test, base, target):
    match = regexc.search(base)
    assert match is not None, test+": regex didn't match for "+base
    assert match.group(1) == target, test+": "+target+" not found in "+base
    print "test %s passed"%test

check("Empty","''","")
check("single escape1", r""" Example: 'Foo \' Bar'  End. """,r"Foo \' Bar")
check("single escape2", r"""'\''""",r"\'")
check("double escape",r""" Example2: 'Foo \\' End. """,r"Foo \\")
check("First quote escaped",r"not matched\''a'","a")
check("First quote escaped beginning",r"\''a'","a")

正则表达式r"(?:^|[^\\])'(([^\\']|\\'|\\\\)*)'"仅向前匹配字符串中我们想要的内容:

  1. 不反斜或引号的字符。
  2. Escaped quote
  3. Escaped backslash
  4. 编辑:

    在前面添加额外的正则表达式以检查转义的第一个报价。

答案 2 :(得分:3)

Douglas Leeder的模式((?:^|[^\\])'(([^\\']|\\'|\\\\)*)')将无法匹配"test 'test \x3F test' test""test \\'test' test"。 (包含除quote和反斜杠之外的转义的字符串,以及以转义反斜杠开头的字符串。)

cletus'模式((?<!\\)'((?:\\'|[^'])*)(?<!\\)')将无法与"test 'test\\' test"匹配。 (以转义反斜杠结尾的字符串。)

我对单引号字符串的建议如下:

(?<!\\)(?:\\\\)*'((?:\\.|[^\\'])*)'

对于单引号或双引号蜇,您可以使用:

(?<!\\)(?:\\\\)*("|')((?:\\.|(?!\1)[^\\])*)\1

使用Python测试运行:

Doublas Leeder´s test cases:
"''" matched successfully: ""
" Example: 'Foo \' Bar'  End. " matched successfully: "Foo \' Bar"
"'\''" matched successfully: "\'"
" Example2: 'Foo \\' End. " matched successfully: "Foo \\"
"not matched\''a'" matched successfully: "a"
"\''a'" matched successfully: "a"

cletus´ test cases:
"'testing 123'" matched successfully: "testing 123"
"'testing 123\\'" matched successfully: "testing 123\\"
"'testing 123" didn´t match, as exected.
"blah 'testing 123" didn´t match, as exected.
"blah 'testing 123'" matched successfully: "testing 123"
"blah 'testing 123' foo" matched successfully: "testing 123"
"this 'is a \' test'" matched successfully: "is a \' test"
"another \' test 'testing \' 123' \' blah" matched successfully: "testing \' 123"

MizardX´s test cases:
"test 'test \x3F test' test" matched successfully: "test \x3F test"
"test \\'test' test" matched successfully: "test"
"test 'test\\' test" matched successfully: "test\\"

答案 3 :(得分:1)

如果我理解你在说什么(我不确定),你想在你的字符串中找到引用的字符串,忽略转义引号。是对的吗?如果是这样,试试这个:

/(?<!\\)'((?:\\'|[^'])*)(?<!\\)'/

基本上:

  • 以单引号开头,前面没有反斜杠;
  • 匹配零次或多次出现:反斜杠然后引用或除引号之外的任何字符;
  • 以引用结尾;
  • 不要将中间括号分组(?:运算符);和
  • 结束引号前面不能有反斜杠。

好的,我已经用Java测试了这个(对不起,这比我的思考更多但是原理是相同的):

private final static String TESTS[] = {
        "'testing 123'",
        "'testing 123\\'",
        "'testing 123",
        "blah 'testing 123",
        "blah 'testing 123'",
        "blah 'testing 123' foo",
        "this 'is a \\' test'",
        "another \\' test 'testing \\' 123' \\' blah"
};

public static void main(String args[]) {
    Pattern p = Pattern.compile("(?<!\\\\)'((?:\\\\'|[^'])*)(?<!\\\\)'");
    for (String test : TESTS) {
        Matcher m = p.matcher(test);
        if (m.find()) {
            System.out.printf("%s => %s%n", test, m.group(1));
        } else {
            System.out.printf("%s doesn't match%n", test);
        }
    }
}

结果:

'testing 123' => testing 123
'testing 123\' doesn't match
'testing 123 doesn't match
blah 'testing 123 doesn't match
blah 'testing 123' => testing 123
blah 'testing 123' foo => testing 123
this 'is a \' test' => is a \' test
another \' test 'testing \' 123' \' blah => testing \' 123

这似乎是正确的。

答案 4 :(得分:0)

在Python的re.findall()中使用cletus'表达式:

re.findall(r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'", s)

测试在字符串中找到几个匹配项:

>>> re.findall(r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'",
 r"\''foo bar gazonk' foo 'bar' gazonk 'foo \'bar\' gazonk' 'gazonk bar foo\'")
['foo bar gazonk', 'bar', "foo \\'bar\\' gazonk"]
>>>

使用cletus'TESTS字符串数组:

["%s => %s" % (s, re.findall(r"(?<!\\)'((?:\\'|[^'])*)(?<!\\)'", s)) for s in TESTS]

像魅力一样工作。 (亲自测试或者接受我的话。)

答案 5 :(得分:0)

>>> print re.findall(r"('([^'\\]|\\'|\\\\)*')",r""" Example: 'Foo \' Bar'  End. """)[0][0]

'Foo \'Bar'