使用正则表达式从内容中删除一个div

时间:2016-04-29 23:26:44

标签: php regex preg-replace

我试图从一个内容块中删除一个特定的div(它的内部内容),但是它不能正常工作。

正则表达式:

/<div class="greybackground_desktop".*>(.*)<\/div>/s

的preg_replace:

preg_replace($pattern, "", $holder, -1, $count );

现在,正则表达式确实剥离了我的div,但是如果有任何其他关闭div标签,它也会将它们剥离出来并且其中包含任何其他内容。

e.g。

<p>some random text</p>

<div class="greybackground_desktop" style="background-color:#EFEFEF;">
<!-- /49527960/CSF_Article_Middle -->
<div style="padding-bottom:10px; padding-top: 10px; text-align:center;" id='div-gpt-ad-1441883689230-0'>
<script type='text/javascript'>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1441883689230-0'); });
</script>
</div>
</div>

<p>some more text</p>

<div><p>example of content that will be incorrectly removed</p></div>

<p>Text that follows</p>

这将产生以下输出:

some random text

Text that follows

我想看到的是:

some random text

some more text

example of content that will be incorrectly removed

Text that follows

有什么想法吗?

2 个答案:

答案 0 :(得分:3)

使用DOMDocument之类的解析器。请考虑以下代码:

<?php
$dom = new DOMDocument();
$dom->loadHTML($your_html_here);

$xpath = new DOMXpath($dom);

foreach ($xpath->query("//div[@class='greybackground_desktop']") as $div)
    $div->parentNode->removeChild($div);

echo $dom->saveHTML();
?>

该脚本会加载您的html,查找包含div.greybackground_desktop的元素并删除这些元素。可以在ideone.com上找到演示

答案 1 :(得分:1)

正确的方法是使用像DOMDocument这样的Html Parser,这是一个例子:

$holder = <<< LOL
<p>some random text</p>
<div class="greybackground_desktop" style="background-color:#EFEFEF;">
<!-- /49527960/CSF_Article_Middle -->
<div style="padding-bottom:10px; padding-top: 10px; text-align:center;" id='div-gpt-ad-1441883689230-0'>
<script type='text/javascript'>
googletag.cmd.push(function() { googletag.display('div-gpt-ad-1441883689230-0'); });
</script>
</div>
</div>
<p>some more text</p>
<div><p>example of content that will be incorrectly removed</p></div>
<p>Text that follows</p>
LOL;
$dom = new DOMDocument();
//avoid the whitespace after removing the node
$dom->preserveWhiteSpace = false;
//parse html dom elements
$dom->loadHTML($holder);
//get the div from dom
if($div = $dom->getElementsByTagName('div')->item(0)) {
   //remove the node by telling the parent node to remove the child
   $div->parentNode->removeChild($div);
   //save the new document
   echo $dom->saveHTML();
}

Ideone DOMDocument Demo

如果您真的想使用正则表达式,请使用 lazy 一个.*?代替贪婪 .*,即:

$result = preg_replace('%<div class="greybackground_desktop".*?</div>\s+</div>%si', '', $holder);

Ideone Demo

详细了解正则表达式重复,特别是“ 懒惰而不是贪婪

http://www.regular-expressions.info/repeat.html