制作url regex全球

时间:2012-09-10 13:21:21

标签: php regex url

我一直在寻找一个正则表达式来替换字符串中的纯文本url(字符串可以包含多个url),通过:

 <a href="url">url</a>

我发现了这个: http://mathiasbynens.be/demo/url-regex

我想使用diegoperini的正则表达式(根据测试是最好的):

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

但是我想让它全局替换字符串中的所有url。 当我使用它时:

/_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS/g

它不起作用,我如何使这个正则表达式全局化,开头的下划线和最后的“_iuS”是什么意思?

我想用php使用它,所以我正在使用:

preg_replace($regex, '<a href="$0">$0</a>', $examplestring);

2 个答案:

答案 0 :(得分:0)

下划线是正则表达式分隔符,i,u和S是模式修饰符:

  

我(PCRE_CASELESS)

If this modifier is set, letters in the pattern match both upper and lower 
case letters.
     

U(PCRE_UNGREEDY)

This modifier inverts the "greediness" of the quantifiers so that they are 
not greedy by default, but become greedy if followed by ?. It is not compatible
with Perl. It can also be set by a (?U) modifier setting within the pattern 
or by a question mark behind a quantifier (e.g. .*?).
     

取值

When a pattern is going to be used several times, it is worth spending more 
time analyzing it in order to speed up the time taken for matching. If this 
modifier is set, then this extra analysis is performed. At present, studying 
a pattern is useful only for non-anchored patterns that do not have a single 
fixed starting character.

有关更多信息,请参阅http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

当您添加/ ... / g时,您添加了另一个正则表达式分隔符加上PCRE中不存在的修饰符g,这就是它无效的原因。

答案 1 :(得分:0)

我同意@verdesmarald并在以下函数中使用此模式:

$string = preg_replace_callback(
        "_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS",
        create_function('$match','
            $m = trim(strtolower($match[0]));
            $m = str_replace("http://", "", $m);
            $m = str_replace("https://", "", $m);
            $m = str_replace("ftp://", "", $m);
            $m = str_replace("www.", "", $m);

            if (strlen($m) > 25)
            {
                $m = substr($m, 0, 25) . "...";
            }

            return "<a href=\"$match[0]\">$m</a>";
                '), $string);

    return $string;

它似乎可以解决问题,并解决我遇到的问题。正如@verdesmarald所说,删除^和$字符使得模式甚至可以在我的pre_replace_callback()中工作。

只有我关心的是模式的效率。如果在繁忙/高流量的网络应用程序中使用,它是否会导致瓶颈?

<强>更新

如果在网址的路径部分末尾有一个跟踪点,则上述正则表达式模式会中断,如http://www.mydomain.com/page.。为了解决这个问题,我通过添加^.修改了正则表达式模式的最后部分,使最终部分看起来像[^\s^.]。当我读它时,不要匹配尾随空格或点。

到目前为止我的测试似乎工作正常。