Question

我正在尝试从数学公式中清除文本。我正在使用正则表达式替换它们，但它似乎不适用于跨越多行的公式。

例如，我有这个（脏）字符串：

tst = """
The appropriate tool to deal in a statistical mechanics framework with a
system of quantum numbers latexmath:[$\vec X=(N,S,Q,C,B)$] is the
canonical partition function (*???*; *???*)
latexmath:[\[Z_{N,S,Q,C,B} = \frac {1}{(2\pi)^5}\int\limits^{\pi}_{-\pi}
d^5\vec\phi\;e^{i\vec\phi\vec X} \exp{(\sum_j z_j)},
\label{eq:partition1}\]] where
latexmath:[\[z_j= g_j\frac{V}{(2\pi)^3}\int d^3p\;\ln(1\pm
\exp{(-\sqrt{p^2+m_j^2}/T -i\vec x_j\vec \phi)})^{\pm 1},
\label{eq:partition1b}\]] 
"""

这是我必须匹配公式的正则表达式：

regex = r"(fm|cm)?latexmath:\[.+?\]\]?"

我也在regex101进行了测试，似乎有效。

我使用re.DOTALL来匹配多行。它可以替换不跨越两行而不是其余行的公式。但是，当我使用finditer时，我发现它也找到了多行表达式。

以下是使用re.finditer的结果。 re.DOTALL标志在这里工作：

In [48]: for match in re.finditer(regex, tst):
    ...:     print(repr(match.group(0)))
    ...:     
latexmath:[$\vec X=(N,S,Q,C,B)$]

In [49]: for match in re.finditer(regex, tst, re.DOTALL):
    ...:     print(repr(match.group(0)))
    ...:     
    ...:     
'latexmath:[$\\vec X=(N,S,Q,C,B)$]'
'latexmath:[\\[Z_{N,S,Q,C,B} = \\frac {1}{(2\\pi)^5}\\int\\limits^{\\pi}_{-\\pi}\nd^5\\vec\\phi\\;e^{i\\vec\\phi\\vec X} \\exp{(\\sum_j z_j)},\n\\label{eq:partition1}\\]]'
'latexmath:[\\[z_j= g_j\\frac{V}{(2\\pi)^3}\\int d^3p\\;\\ln(1\\pm\n\\exp{(-\\sqrt{p^2+m_j^2}/T -i\\vec x_j\\vec \\phi)})^{\\pm 1},\n\\label{eq:partition1b}\\]]'

但是，相同的正则表达式不适用于re.sub（它只替换第一次出现）：

In [50]: re.sub(regex, 'XXXX' , tst, re.DOTALL)
Out[50]: '\nThe appropriate tool to deal in a statistical mechanics framework with a\nsystem of quantum numbers XXXX is the\ncanonical partition function (*???*; *???*)\nlatexmath:[\\[Z_{N,S,Q,C,B} = \\frac {1}{(2\\pi)^5}\\int\\limits^{\\pi}_{-\\pi}\nd^5\\vec\\phi\\;e^{i\\vec\\phi\\vec X} \\exp{(\\sum_j z_j)},\n\\label{eq:partition1}\\]] where\nlatexmath:[\\[z_j= g_j\\frac{V}{(2\\pi)^3}\\int d^3p\\;\\ln(1\\pm\n\\exp{(-\\sqrt{p^2+m_j^2}/T -i\\vec x_j\\vec \\phi)})^{\\pm 1},\n\\label{eq:partition1b}\\]] '

In [51]: re.sub(regex, 'XXXX' , tst)
Out[51]: '\nThe appropriate tool to deal in a statistical mechanics framework with a\nsystem of quantum numbers XXXX is the\ncanonical partition function (*???*; *???*)\nlatexmath:[\\[Z_{N,S,Q,C,B} = \\frac {1}{(2\\pi)^5}\\int\\limits^{\\pi}_{-\\pi}\nd^5\\vec\\phi\\;e^{i\\vec\\phi\\vec X} \\exp{(\\sum_j z_j)},\n\\label{eq:partition1}\\]] where\nlatexmath:[\\[z_j= g_j\\frac{V}{(2\\pi)^3}\\int d^3p\\;\\ln(1\\pm\n\\exp{(-\\sqrt{p^2+m_j^2}/T -i\\vec x_j\\vec \\phi)})^{\\pm 1},\n\\label{eq:partition1b}\\]] '

我不确定为什么re.sub在这里不使用多行表达式。

Answer 1

尝试

(?s)(?:fm|cm)?latexmath:\[.+?\]\]?

只需将re.DOTALL放入regex

（？iLmsux）
  （来自集合'i'，'L'，'m'，'s'，'u'，'x'的一个或多个字母。）该组匹配空字符串;字母设置相应的标志：re.I（忽略大小写），re.L（依赖于语言环境），re.M（多行），re.S（点匹配所有），re.U（取决于Unicode），以及re.X（详细），用于整个正则表达式。（标志在模块内容中描述。）如果您希望将标志包含在正则表达式的一部分中，而不是将标志参数传递给re.compile（）函数，这将非常有用。

请注意，（？x）标志会更改表达式的解析方式。它应该首先在表达式字符串中使用，或者在一个或多个空格字符之后使用。如果标志前面有非空白字符，则结果未定义。

或

print (re.sub(regex, 'XXXX' , tst, flags=re.DOTALL))

请参阅def sub(pattern, repl, string, count=0, flags=0):，S = DOTALL = sre_compile.SRE_FLAG_DOTALL和SRE_FLAG_DOTALL = 16。

Answer 2

试试这个：

re.sub(regex, 'XXXX' , tst, flags=re.DOTALL)

我们有

re.sub(pattern, repl, string, count=0, flags=0)

然后

re.sub(regex, 'XXXX' , tst, re.DOTALL)

等于re.sub(regex, 'XXXX' , tst, count=re.DOTALL)，因此无法正常工作。

有关re.sub的更多详情，请参阅https://docs.python.org/2/library/re.html#re.sub

Python正则表达式，替换多行匹配不起作用

2 个答案: