Question

我一直在努力使正则表达式能够匹配“任何东西”，但是某个令牌，我正在按照这个答案（Match everything except for specified strings）但它根本不适用于我...

这是一个例子

text = '<a> whatever href="obviously_a_must_have" whatever <div> this div should be accepted </div> ... </a>'

regex = r'<a[^><]*href=\"[^\"]+\"(?!.*(</a>))*</a>' #(not working as intended)

[^><]* #- should accept any number of characters except < and >, meaning it shouldn't close the tag nor open a new one - *working*;
href=\"[^\"]+\" #- should match an href - *working*;
(?!.*(</a>))* #- should match anything but the end of the tag a - *NOT WORKING*.

Answer 1

问题在于

(?!.*(</a>))*

你有两个错误。

/应该被转义。请改用\/。
您不能在另一个*上使用*。在regex101上尝试，然后会说：* The preceding token is not quantifiable。我强烈建议该网站进行正则表达式测试和理解。

你的第一部分也不起作用，因为你有＆gt;在文本和正则表达式之后将不匹配。

让我们开始尝试：

<a>[^><]*href=\"[^\"]+\".*(?:<\/a>)

正则表达式要好得多，它会匹配你的文字。但它还没有完整，因为它也会将文本与额外的结尾相匹配。在真实结束之前，我们不希望在任何地方出现这种额外的结果。所以，让我们添加一个负面的背后隐藏：

<a>[^><]*href=\"[^\"]+\"(?:(?<!<\/a>).)*(?:<\/a>)

但正如您可以看到here，它只是匹配第一个结束标记并点燃其他标记。我们想要讨厌它。此外，我们不需要额外的开始标记。让我们通过开始和结束限制匹配。

^<a>[^><]*href=\"[^\"]+\"(?:(?<!<\/a>).)*(?:<\/a>)$

Here是测试。

也许，你宁愿在<a...>中保留href？有点像：

'<a whatever href="obviously_a_must_have"> whatever <div> this div should be accepted </div> ... </a>'

然后，正则表达式将是：

^<a[^><]*href=\"[^\"]+\"[^><]*>(?:(?<!<\/a>).)*(?:<\/a>)$

测试是here。

在开发正则表达式的过程中，我建议先制作一些简单的东西，其中许多。*会匹配所有东西，并逐步改变它们的真实部分。

正则表达式能够匹配除某个令牌之外的任何东西

1 个答案: