如何将rel =“nofollow”添加到preg_replace()的链接

时间:2011-02-18 04:24:49

标签: php regex preg-match

以下功能旨在将rel="nofollow"属性应用于所有外部链接而不包含内部链接,除非该路径与下面定义为$my_folder的预定义根URL匹配。

所以给出变量......

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

内容......

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com">external</a>

最终结果,更换后应该......

<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator" rel="nofollow">internal cloaked link</a>

<a href="http://cnn.com" rel="nofollow">external</a>

请注意,第一个链接不会被更改,因为它是一个内部链接。

第二行上的链接也是内部链接,但由于它与我们的$my_folder字符串匹配,因此它也会获得nofollow

第三个链接最简单,因为它与blog_url不匹配,它显然是一个外部链接。

但是,在下面的脚本中,我的所有链接都获得nofollow。如何修复脚本以执行我想要的操作?

function save_rseo_nofollow($content) {
$my_folder =  $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
    preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
    for ( $i = 0; $i <= sizeof($matches[0]); $i++){
        if ( !preg_match( '~nofollow~is',$matches[0][$i])
            && (preg_match('~' . $my_folder . '~', $matches[0][$i]) 
               || !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
            $result = trim($matches[0][$i],">");
            $result .= ' rel="nofollow">';
            $content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
        }
    }
    return $content;
}

9 个答案:

答案 0 :(得分:14)

这是DOMDocument解决方案......

$str = '<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me">external</a>

<a href="http://google.com">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel">external</a>
';
$dom = new DOMDocument();

$dom->preserveWhitespace = FALSE;

$dom->loadHTML($str);

$a = $dom->getElementsByTagName('a');

$host = strtok($_SERVER['HTTP_HOST'], ':');

foreach($a as $anchor) {
        $href = $anchor->attributes->getNamedItem('href')->nodeValue;

        if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
           continue;
        }

        $noFollowRel = 'nofollow';
        $oldRelAtt = $anchor->attributes->getNamedItem('rel');

        if ($oldRelAtt == NULL) {
            $newRel = $noFollowRel;
        } else {
            $oldRel = $oldRelAtt->nodeValue;
            $oldRel = explode(' ', $oldRel);
            if (in_array($noFollowRel, $oldRel)) {
                continue;
            }
            $oldRel[] = $noFollowRel;
            $newRel = implode($oldRel,  ' ');
        }

        $newRelAtt = $dom->createAttribute('rel');
        $noFollowNode = $dom->createTextNode($newRel);
        $newRelAtt->appendChild($noFollowNode);
        $anchor->appendChild($newRelAtt);

}

var_dump($dom->saveHTML());

输出

string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="http://localhost/mytest/">internal</a>

<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>

<a href="http://cnn.com" rel="me nofollow">external</a>

<a href="http://google.com" rel="nofollow">external</a>

<a href="http://example.com" rel="nofollow">external</a>

<a href="http://stackoverflow.com" rel="junk in the rel nofollow">external</a>
</body></html>
"

答案 1 :(得分:9)

首先尝试使其更具可读性,然后才能使if规则更复杂:

function save_rseo_nofollow($content) {
    $content["post_content"] =
    preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
    return $content;
}

function cb2($match) { 
    list($original, $tag) = $match;   // regex match groups

    $my_folder =  "/hostgator";       // re-add quirky config here
    $blog_url = "http://localhost/";

    if (strpos($tag, "nofollow")) {
        return $original;
    }
    elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
        return $original;
    }
    else {
        return "<$tag rel='nofollow'>";
    }
}

提供以下输出:

[post_content] =>
  <a href="http://localhost/mytest/">internal</a>
  <a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>    
  <a href="http://cnn.com" rel=nofollow>external</a>

原始代码中的问题可能是$ rseo,但未在任何地方声明。

答案 2 :(得分:7)

试试这个(PHP 5.3 +):

  • 跳过所选地址
  • 允许手动设置rel参数

和代码:

function nofollow($html, $skip = null) {
    return preg_replace_callback(
        "#(<a[^>]+?)>#is", function ($mach) use ($skip) {
            return (
                !($skip && strpos($mach[1], $skip) !== false) &&
                strpos($mach[1], 'rel=') === false
            ) ? $mach[1] . ' rel="nofollow">' : $mach[0];
        },
        $html
    );
}

示例:

echo nofollow('<a href="link somewhere" rel="something">something</a>');
// will be same because it's already contains rel parameter

echo nofollow('<a href="http://www.cnn.com">something</a>'); // ad
// add rel="nofollow" parameter to anchor

echo nofollow('<a href="http://localhost">something</a>', 'localhost');
// skip this link as internall link

答案 3 :(得分:3)

使用正则表达式来正确完成这项工作会非常复杂。使用实际的解析器会更容易,例如DOM extension中的解析器。 DOM不是非常适合初学者,因此您可以使用DOM加载HTML,然后使用SimpleXML运行修改。它们由相同的库支持,因此很容易与另一个库一起使用。

以下是它的外观:

$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';

$html = '<html><body>
<a href="http://localhost/mytest/">internal</a>
<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
<a href="http://cnn.com">external</a>
</body></html>';

$dom = new DOMDocument;
$dom->loadHTML($html);

$sxe = simplexml_import_dom($dom);

// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[@href]') as $a)
{
    if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
     && substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
    {
        // skip all links that start with the URL in $blog_url, as long as they
        // don't start with the URL from $my_folder;
        continue;
    }

    if (empty($a['rel']))
    {
        $a['rel'] = 'nofollow';
    }
    else
    {
        $a['rel'] .= ' nofollow';
    }
}

$new_html = $dom->saveHTML();
echo $new_html;

正如您所看到的,它非常简短。根据您的需要,您可能希望使用preg_match()代替strpos()内容,例如:

    // change the regexp to your own rules, here we match everything under
    // "http://localhost/mytest/" as long as it's not followed by "go"
    if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
    {
        continue;
    }

注意

当我第一次阅读问题时,我错过了OP中的最后一个代码块。我发布的代码(基本上是基于DOM的任何解决方案)更适合处理整个页面而不是HTML块。否则,DOM会尝试“修复”您的HTML并添加<body>标记,DOCTYPE等...

答案 4 :(得分:0)

<?

$str='<a href="http://localhost/mytest/">internal</a>
<a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
<a href="http://cnn.com">external</a>';

function test($x){
  if (preg_match('@localhost/mytest/(?!go/)@i',$x[0])>0) return $x[0];
  return 'rel="nofollow" '.$x[0];
}

echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);

?>

答案 5 :(得分:0)

这是另一个具有白名单选项并添加tagret Blank属性的解决方案。 并且还会在添加新属性之前检查是否已经存在rel属性。

function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true) 
{
    $Whitelist[] = $_SERVER['HTTP_HOST'];
    foreach ($Whitelist as $Key => $Link) 
    {
        $Host = preg_replace('#^https?://#', '', $Link);
        $Host = "https?://". preg_quote($Host, '/');
        $Whitelist[$Key] = $Host;
    }

    if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER)) 
    {
        foreach ($matches as $Anchor_Tag) 
        {
            $IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag =  false;
            if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2)) 
            {
                foreach ($All_matches2[1] as $Key => $Attr_Name)
                {
                    if($Attr_Name == 'href')
                    {
                        $Is_Valid_Tag = true;
                        $Url = $All_matches2[2][$Key];
                        // bypass #.. or internal links like "/"
                        if(preg_match('/^\s*[#|\/].*/', $Url)) 
                        {
                            continue 2;
                        }

                        foreach ($Whitelist as $Link) 
                        {
                            if (preg_match("#$Link#", $Url)) {
                                continue 3;
                            }
                        }
                    }
                    else if($Attr_Name == 'rel')
                    {
                        $IS_Rel_Exist = true;
                        $Rel = $All_matches2[2][$Key];
                        preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
                        if( count($match) > 0 )
                        {
                            $IS_Follow_Exist = true;
                        }
                        else
                        {
                            $New_Rel = 'rel="'. $Rel . ' nofollow"';
                        }
                    }
                    else if($Attr_Name == 'target')
                    {
                        $IS_Target_Blank_Exist = true;
                    }
                }
            }

            $New_Anchor_Tag = $Anchor_Tag;
            if(!$IS_Rel_Exist)
            {
                $New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
            }
            else if(!$IS_Follow_Exist)
            {
                $New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
            }

            if($Add_Target_Blank && !$IS_Target_Blank_Exist)
            {
                $New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
            }

            $Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
        }
    }
    return $Content;
}

要使用它:

$Page_Content = '<a href="http://localhost/">internal</a>
                 <a href="http://yoursite.com">internal</a>
                 <a href="http://google.com">google</a>
                 <a href="http://example.com" rel="nofollow">example</a>
                 <a href="http://stackoverflow.com" rel="random">stackoverflow</a>';

$Whitelist = ["http://yoursite.com","http://localhost"];

echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);

答案 6 :(得分:0)

感谢@alex提供出色的解决方案。但是,我在日语文字方面遇到了问题。我已经按照以下方式修复了它。另外,此代码可以使用SyntaxError: Unexpected token '<' SyntaxError: Unexpected token '<' 数组跳过多个域。

$whiteList

答案 7 :(得分:0)

WordPress 决定:

function replace__method($match) {
    list($original, $tag) = $match;   // regex match groups

    $my_folder =  "/articles";       // re-add quirky config here
    $blog_url = 'https://'.$_SERVER['SERVER_NAME'];

    if (strpos($tag, "nofollow")) {
        return $original;
    }
    elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
        return $original;
    }
    else {
        return "<$tag rel='nofollow'>";
    }
}

add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );

function add_nofollow_to_external_links( $content ) {
    $content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
    return $content;
}

答案 8 :(得分:-1)

一个允许自动添加nofollow并保留其他属性的好脚本

function nofollow(string $html, string $baseUrl = null) {
    return preg_replace_callback(
            '#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
                list ($a, $attr, $text) = $mach;
                if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
                    $url = $url[1];
                    if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
                        if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
                            $relAttr = $rel[0];
                            $rel = $rel[1];
                        }
                        $rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
                        $attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
                        $a = '<a ' . $attr . '>' . $text . '</a>';
                    }
                }
                return $a;
            },
            $html
    );
}