忽略正则表达式中的特定标记 - 否定前瞻

时间:2017-11-05 06:30:06

标签: php regex regex-negation regex-lookarounds regex-group

所以,我在我的php代码中有这个场景,我有以下字符串

This is an outside Example <p href="https://example.com"> This is a para Example</p><markup class="m"> this is a markup example</markup>

我想对此字符串中的单词example进行不区分大小写的搜索,但

  • 我希望我的正则表达式忽略标记属性中出现的示例(我能够实现)
  • 我想完全忽略以下<markup ..> any content </markup>内的搜索

到目前为止,我所做的是,

/(example)(?:[^<]*>)/i

此工作正常,忽略了href p标记内的示例, 现在我已经为<markup>

修改了它

/(example)(?!([^<]*>)|(\<markup[^>]*>[^<]*<\/markup\>))/i

但这不起作用。 你可以看到我的作品 - https://regex101.com/r/e2XujN/1

  

我希望通过此

实现目标

我将按以下方式替换匹配的example

  • 假设我发现eXamPle它将被<markup>eXamPle</markup>
  • 替换
  • Example将由<markup>Example</markup>
  • 替换

等等,

  

注意:匹配字符串中的模式和替换字符串的情况相同

3 个答案:

答案 0 :(得分:1)

您可以使用PCRE中预测的"application_attributes": "apache", "max_uri": "/clock.php?reqtime=1491224912479", "upload": 15243, "classification_engine": "classic", "version_string": "Windows NT 10.0", "cookies": [ { "cookie": "_pk_id.1.2283=57dd5ddf65ceff96.1491224886.1.1491224886.1491224886.; _pk_ses.1.2283=*", "pcap": "/testx6.pcap", "num": 8117, "length": 535, "ip_src": "x.x.x.x", "coordinates": "", "host": "www.theforest.us", "uri": "/clock.php?reqtime=1491224894479", "device": "windows", "user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "city_ids": "", "ts": 1491224897.000065 }, { "cookie": "_pk_id.1.2283=57dd5ddf65ceff96.1491224886.1.1491224886.1491224886.; _pk_ses.1.2283=*", "pcap": "/testx6.pcap", "num": 8199, "length": 535, "ip_src": "x.x.x.x", "coordinates": "", "host": "www.theforest.us", "uri": "/clock.php?reqtime=1491224906480", "device": "windows", "user_agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "city_ids": "", "ts": 1491224909.000065 }], "cerdentials": null, "internal_ip": "x.x.x.x", "download": 31178, "first_packet": 7945, "pcap": "/testx6.pcap", "flow_id": 222, "flag_int_ext": true, "external_ip": "x.x.x.x", "protocol_stack": "ipv4,tcp,http", "dr_type": "apache", "os": "Windows", "src_port": 54061, "max_uri_host": "", "encrypted": false, "accounts_count": null, "application": "", "dst_port": 80, "files": [ { "src_ip": "x.x.x.x", "pnum": 7953, "pcap": "/testx6.pcap", "file_name": "Image file 1.png", "file_extension": "png", "file_size": 1278, "dst_ip": "x.x.x.x", "timestamp": "2017-04-03 13-08-08.000148", "md5": "df3b6fb119a8be8abe44deb021b4c80c" }, { "src_ip": "x.x.x.x", "pnum": 7953, "pcap": "/testx6.pcap", "file_name": "Image file 2.png", "file_extension": "png", "file_size": 510, "dst_ip": "x.x.x.x", "timestamp": "2017-04-03 13-08-08.000148", "md5": "482f3baa4842ea727d32ac147daa47b8" }, { "src_ip": "x.x.x.x", "pnum": 7953, "pcap": "/testx6.pcap", "file_name": "Image file 1.gif", "file_extension": "gif", "file_size": 366, "dst_ip": "x.x.x.x", "timestamp": "2017-04-03 13-08-08.000148", "md5": "08eae37a90618ac55d9a7cffc82c736c" }],, "locations_count": 0, "protocol_title": "http", "device": "windows" 匹配并跳过模式/字符串(此处为标记)所包含的某些子字符串,如下所示:

(markup).*\1(*SKIP)(*F)|(example)(?![^<]*>)

说明:

排除的子串:第一个捕获组
 (标记):字面匹配字符标记(不区分大小写)
 (*SKIP)(*F)匹配任何字符(行终止符除外)
.*匹配与第一个捕获组相同的文本 \1结束了 (*SKIP)(* FAIL)的简写,不匹配

答案 1 :(得分:1)

您可以像解决第一个问题一样解决问题。检查字符串是否后面没有结束标记。

<强>正则表达式:

(example)(?![^<]*>)(?![^<]*<\/markup\>)

Demo

答案 2 :(得分:0)

答案是使用DOM,但使用文本节点并在其中插入HTML内容有点棘手。

<强> PHP live demo

$content = <<< 'HTML'
This is an outside Example <p href="https://example.com"> This is a para Example</p>
test <markup class="m"> this is a markup example</markup> another example <p>example</p>
HTML;

// Initialize a DOM object
$dom = new DOMDocument();
// Use an HTML element tag as our HTML container
// @hakre [https://stackoverflow.com/a/29499718/1020526]
@$dom->loadHTML("<div>$content</div>");

$wrapper = $dom->getElementsByTagName('div')->item(0);
// Remove wrapper
$wrapper = $wrapper->parentNode->removeChild($wrapper);
// Remove all nodes of $dom object
while ($dom->firstChild) {
    $dom->removeChild($dom->firstChild);
}
// Append all $wrapper object nodes to $dom
while ($wrapper->firstChild) {
    $dom->appendChild($wrapper->firstChild);
}

$dox = new DOMXPath($dom);
// Query all elements in addition to text nodes
$query = $dox->query('/* | /text()');

// Iterate through all nodes
foreach ($query as $node) {
    // If it's not an HTML element
    if ($node->nodeType != XML_ELEMENT_NODE) {
        // Replace desired word / content
        $newContent = preg_replace('~(example)~i',
            '<markup>$1</markup>',
            $node->wholeText);
        // We can't insert HTML directly into our node
        // so we need to create a document fragment
        $newNode = $dom->createDocumentFragment();
        $newNode->appendXML($newContent);
        // Replace new content with old one
        $node->parentNode->replaceChild($newNode, $node);
    }
}

// Save modifications
echo $dom->saveHTML($dom);