Question

我在Python（2.7.9）

中遇到了正则表达式的问题

我正在尝试使用像这样的正则表达式删除HTML <span>标记：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)

（正则表达式如此读取：<span，任何不是>，然后是>，然后非贪婪匹配任何内容，然后是</span>，并使用re.S（re.DOTALL），以便.匹配换行符

除非文本中有换行符，否则这似乎有效。看起来re.S（DOTALL）不适用于非贪婪的比赛。

这是测试代码;从text1中删除换行符，re.sub工作。把它放回去，re.sub失败了。将换行符char放在<span>标记之外，re.sub可以正常工作。

#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)

为了比较，我写了一个Perl脚本来做同样的事情;正则表达式正如我所期待的那样工作。

#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";

有什么想法吗？

在Python 2.6.6和Python 2.7.9中测试

Answer 1

re.sub的第4个参数是count，而不是flags。

re.sub(pattern, repl, string, count=0, flags=0)¶

您需要使用关键字参数来明确指定flags：

re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
                                                      ↑↑↑↑↑↑

否则，re.S将被解释为替换计数（最多16次）而不是S（或DOTALL标志）：

>>> import re
>>> re.S
16

>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'

>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'

python re.sub非greed替换失败，字符串中有换行符

1 个答案: