Question

（在我开始之前：我在python中这样做）

所以基本上我需要我的单一正则表达式匹配我的html QUOT标签之前和之后的所有引号：如果这些空格中存在引号，我需要它匹配。

示例：

<QUOT.START> Hello, this doesn't match! <\QUOT.END> 

"<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> "

我有4个不同的正则表达式用于此目的：

1.   \"+(?=<QUOT\.START>)

2.   (?<=<QUOT\.START>)\"+

3.   \"+(?=<\\QUOT\.END>)

4.   (?<=<\\QUOT\.END>)\"+

我可以将这4个合并为一个吗？

Answer 1

如果您能够使用较新的regex module（支持无限外观），您可以将表达式浓缩为

(?<=<\\?QUOT\.(?:START|END)>[\t ]*)" # matches quotes after <quot.start> or <quot.end>
                                     # plus whitespaces, eventually
|
"(?=[\t ]*<\\?QUOT\.(?:START|END)>)  # before <quot.start> or <quot.end>,
                                     # plus whitespaces eventually

<小时/> 没有详细模式：

(?<=<\\?QUOT\.(?:START|END)>[\t ]*)"|"(?=[\t ]*<\\?QUOT\.(?:START|END)>)

<小时/> 一般来说，这是：

(?<=<tag><whitespaces, eventually>)quote|quote(?=<whitespaces, eventually><tag>)

<小时/> 在Python：

import regex as re

string = """
<QUOT.START> Hello, this doesn't match! <\QUOT.END> 
"<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> "
"""

rx = re.compile(r'''(?<=<\\?QUOT\.(?:START|END)>[\t ]*)"|"(?=[\t ]*<\\?QUOT\.(?:START|END)>)''')

for m in rx.finditer(string):
    print(m.group(0))
    print(m.span())

这会引出四个引号及其位置。

Answer 2

@ctwheels帮助我弄清楚这个（超级简单的）解决方案：作为正则表达式的新手，我不知道|（管道）语法。所以这是我想要的最终正则表达式（并且它有效！）

\"+(?=<QUOT\.START>)|(?<=<QUOT\.START>)\"+|\"+(?=<\\QUOT\.END>)|(?<=<\\QUOT\.END>)\"+

Answer 3

你可以试试这个：

s = '<QUOT.START> "Hello, this will call 4 matches! " <\QUOT.END> '
import re
strings = re.findall('\"(.*?)\"', s)

输出：

['Hello, this will call 4 matches! ']

如何将这些正则表达式合并为一个？

3 个答案: