正则表达式忽略<script>标记</script>之间的匹配

时间:2012-09-21 14:40:57

标签: php html regex html-parsing

我道歉,因为我对Regex知之甚少,而且我甚至不知道这个正则表达式究竟是做什么的(我没有写它 - source)除了它搜索某个术语的事实这样就可以突出显示。

这是正则表达式:

/(\b$term|$term\b)(?!([^<]+)?>)/iu

问题是我需要确保它与<script></script>标记之间的任何内容都不匹配。现在我知道脚本标记的编写方式有很多种,但我真正需要做的就是忽略<script/script>之间的任何文本,同时考虑script之间可能的空格和<一样< script/script >

是否有人能够以这种方式修改它?我将通知插件编写此reg-ex的插件,以便将来包含在其中。

编辑:以下是它的来源:

function relevanssi_highlight_terms($excerpt, $query) {
    $type = get_option("relevanssi_highlight");
    if ("none" == $type) {
        return $excerpt;
    }

    switch ($type) {
        case "mark":                        // thanks to Jeff Byrnes
            $start_emp = "<mark>";
            $end_emp = "</mark>";
            break;
        case "strong":
            $start_emp = "<strong>";
            $end_emp = "</strong>";
            break;
        case "em":
            $start_emp = "<em>";
            $end_emp = "</em>";
            break;
        case "col":
            $col = get_option("relevanssi_txt_col");
            if (!$col) $col = "#ff0000";
            $start_emp = "<span style='color: $col'>";
            $end_emp = "</span>";
            break;
        case "bgcol":
            $col = get_option("relevanssi_bg_col");
            if (!$col) $col = "#ff0000";
            $start_emp = "<span style='background-color: $col'>";
            $end_emp = "</span>";
            break;
        case "css":
            $css = get_option("relevanssi_css");
            if (!$css) $css = "color: #ff0000";
            $start_emp = "<span style='$css'>";
            $end_emp = "</span>";
            break;
        case "class":
            $css = get_option("relevanssi_class");
            if (!$css) $css = "relevanssi-query-term";
            $start_emp = "<span class='$css'>";
            $end_emp = "</span>";
            break;
        default:
            return $excerpt;
    }

    $start_emp_token = "*[/";
    $end_emp_token = "\]*";

    if ( function_exists('mb_internal_encoding') )
        mb_internal_encoding("UTF-8");

    $terms = array_keys(relevanssi_tokenize($query, $remove_stopwords = true));

    $phrases = relevanssi_extract_phrases(stripslashes($query));

    $non_phrase_terms = array();
    foreach ($phrases as $phrase) {
        $phrase_terms = array_keys(relevanssi_tokenize($phrase, false));
        foreach ($terms as $term) {
            if (!in_array($term, $phrase_terms)) {
                $non_phrase_terms[] = $term;
            }
        }
        $terms = $non_phrase_terms;
        $terms[] = $phrase;
    }

    usort($terms, 'relevanssi_strlen_sort');

    get_option('relevanssi_word_boundaries', 'on') == 'on' ? $word_boundaries = true : $word_boundaries = false;
    foreach ($terms as $term) {
        $pr_term = preg_quote($term, '/');
        if ($word_boundaries) {
            $excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '\\1' . $end_emp_token, $excerpt);
        }
        else {
            $excerpt = preg_replace("/($pr_term)(?!([^<]+)?>)/iu", $start_emp_token . '\\1' . $end_emp_token, $excerpt);
        }
        // thanks to http://pureform.wordpress.com/2008/01/04/matching-a-word-characters-outside-of-html-tags/
    }

    $excerpt = relevanssi_remove_nested_highlights($excerpt, $start_emp_token, $end_emp_token);

    $excerpt = str_replace($start_emp_token, $start_emp, $excerpt);
    $excerpt = str_replace($end_emp_token, $end_emp, $excerpt);
    $excerpt = str_replace($end_emp . $start_emp, "", $excerpt);
    if (function_exists('mb_ereg_replace')) {
        $pattern = $end_emp . '\s*' . $start_emp;
        $excerpt = mb_ereg_replace($pattern, " ", $excerpt);
    }

    return $excerpt;
}

4 个答案:

答案 0 :(得分:2)

最准确的方法是:

  • 使用适当的HTML解析器解析HTML
  • 忽略<script>标记内的字符串。

您不想尝试使用正则表达式解析HTML。以下是对原因的解释:http://htmlparsing.com/regexes.html

从长远来看,这会让你难过。请查看http://htmlparsing.com/的其余部分,了解可以帮助您入门的一些指示。

答案 1 :(得分:1)

由于lookbehind assertions需要修复,因此您无法使用它们在搜索的字词之前查找前面的<script>标记某处

因此,在替换所需的术语的所有匹配项后,您需要第二遍来恢复那些似乎位于其中的修改术语的出现次数<script>代码。

# provide some sample data
$excerpt = 'My name is bob!

And bob is cool.

<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);

var bob = 5;
</script>

Yeah, the word "bob" works fine.';

$start_emp_token = '<em>';
$end_emp_token = '</em>';
$pr_term = 'bob';

# replace everything (not in a tag)
$excerpt = preg_replace("/(\b$pr_term|$pr_term\b)(?!([^<]+)?>)/iu", $start_emp_token . '$1' . $end_emp_token, $excerpt);

# undo some of the replacements
$excerpt = preg_replace_callback('#(<script(?:[^>]*)>)(.*?)(</script>)#is',
                       create_function(
                         '$matches',
                         'global $start_emp_token, $end_emp_token, $pr_term;
                          return $matches[1].str_replace("$start_emp_token$pr_term$end_emp_token", "$pr_term", $matches[2]).$matches[3];'
                       ),
                       $excerpt);

var_dump($excerpt);

上面的代码产生以下输出:

string(271) "My name is <em>bob</em>!

And <em>bob</em> is cool.

<script type="text/javascript">
var bobby = "It works fine even if you already have tagged the term <em>bob</em> inside the script tag.";
alert(bobby);

var bob = 5;
</script>

Yeah, the word "<em>bob</em>" works fine."

答案 2 :(得分:0)

您在评论中提到在执行搜索之前删除脚本标记是可以接受的。

$data = preg_replace('/<\s*script.*?\/script\s*>/iu', '', $data);

此代码可能对此有所帮助。

答案 3 :(得分:0)

乔治,复活这个古老的问题,因为它有一个简单的解决方案,没有提到。这种情况直接来自我当前的宠物问题,Match (or replace) a pattern except in situations s1, s2, s3 etc

您想要修改以下正则表达式,以排除<script></script>之间的任何内容:

(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)

请原谅我使用$term切换SOMETERM,这是为了清晰起见,因为$在正则表达式中具有特殊含义。

关于在正则表达式中匹配html的所有免责声明,要排除<script></script>之间的任何内容,您只需将其添加到正则表达式的开头:

<script>.*?</script>(*SKIP)(*F)|

所以正则表达式成为:

<script>.*?</script>(*SKIP)(*F)|(\bSOMETERM|SOMETERM\b)(?!([^<]+)?>)

这是如何运作的?

OR 的左侧(即|)匹配完整的<script...</script>标记,然后故意失败。右侧与之前匹配的内容相匹配,我们知道它是正确的东西,因为如果它位于脚本标记之间,它就会失败。

参考

How to match (or replace) a pattern except in situations s1, s2, s3...