Question

我正在尝试编写一个正则表达式，它将匹配可能是html编码的文本字符串中的URL。虽然我看起来有很多麻烦。我需要能够正确匹配下面字符串中两个链接的东西：

 some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" &quot;http://www.notarealwebsite.com&quot; some other text

我想要的详细说明是：“http：//”后跟任意数量的字符，不是空格，引号或字符串“＆amp; quot [分号]”（我不在乎接受其他非url-safe字符作为分隔符）

我已经尝试了几个正则表达式，使用lookahead来检查＆amp; s然后是q，然后是u等等，但是只要我把它放入[^ ...]否定它就完全崩溃了评价更像：“http：//后跟任意数量的不是空格，引号，＆符号，q，u，o，t或分号的字符”，这显然不是我想要的。

这将正确地匹配＆amp; quot [分号]开头的＆amp;'＆lt;

&(?=q(?=u(?=o(?=t(?=;)))))

但这不起作用：

http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*

我非常了解正则表达式会遇到麻烦，其中包括不知道为什么这不会按照我想要的方式运行。我在某种程度上理解积极和消极的看法，但我不明白为什么它在[^ ...]内部分解。是否可以使用正则表达式执行此操作？或者我是否在浪费时间努力使其发挥作用？

Answer 1

如果您的正则表达式实现支持它，请使用正向前瞻和反向引用，并在正文中使用非贪婪表达式。

以下是您的条件之一：(["\s]|")(http://.*?)(?=\1)

例如，在Python中：

import re
p = re.compile(r'(["\s]|&quot;)(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more &quot;{0}&quot; test greed&quot;'
data = formatstr.format(url)    
for m in p.finditer(data):
    print "Found:", m.group(2)

产地：

Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2

或者在Java中：

@Test
public void testRegex() {
    Pattern p = Pattern.compile("([\"\\s]|&quot;)(https?://.*?)(?=\\1)", 
        Pattern.CASE_INSENSITIVE);
    final String URL = "http://test.url/here.php?var1=val&var2=val2";
    final String INPUT = "some text " + URL + " more text + \"" + URL + 
            "\" more then &quot;" + URL + "&quot; testing greed &quot;";

    Matcher m = p.matcher(INPUT);
    while( m.find() ) {
        System.out.println("Found: " + m.group(2));
    }
}

产生相同的输出。

在字符串内匹配网址

1 个答案: