Question

我使用DOMxpath删除具有空文本节点但保留 标记的html标记，

$xpath = new DOMXPath($dom);

while(($nodeList = $xpath->query('//*[not(text()) and not(node()) and not(self::br)]')) && $nodeList->length > 0) 
{
    foreach ($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }
}

它完美无缺，直到我遇到另一个问题，

$content = '<p><br/><br/><br/><br/></p>';

如何删除这种凌乱的 和？这意味着我不想仅 允许，但我允许 使用这样的正确文字，

$content = '<p>first break <br/> second break <br/> the last line</p>';

这可能吗？

或者使用正则表达式会更好吗？

我尝试过这样的事情，

$nodeList = $xpath->query("//p[text()=<br\s*\/?>\s*]");
    foreach($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }

但它会返回此错误，

Warning: DOMXPath::query() [domxpath.query]: Invalid expression in...

Answer 1

您可以使用XPath选择不需要的p：

"//p[count(*)=count(br) and br and normalize-space(.)='']"

注意选择空文本节点不应该更好用（？）：

"//*[normalize-space(.)='' and not(self::br)]"

这将选择任何没有文本节点的元素（但是br），节点如：

<p><b/><i/></p>

或

<p> <br/>   <br/>
</p>

包括在内。

Answer 2

你可以通过简单地检查段落中的唯一内容是空格和 标签来解决所有问题：preg_replace("\<p\>(\s|\<br\s*\/\>)*\<\/p\>","",$content);

细分：

\<p\>    # Match for <p>
(        # Beginning of a group
  \s       # Match a space character
  |        # or...
  \<br\s*\/\> # match a <br /> tag, with any number (including 0) spaces between the <br and />
)*       # Match this whole group (spaces or <br /> tags) 0 or more times.
\<\/p\>  # Match for </p>

但是，我会提到，除非您的HTML格式正确（单行，没有奇怪的空格或段落类等），否则不应使用正则表达式来解析它。如果是的话，这个正则表达式应该可以正常工作。

Answer 3

我的情况几乎相同，我使用：

$document->loadHTML(str_replace('<br>', urlencode('<br>'), $string_or_file));

并使用urlencode()将其更改为显示或插入数据库。它的工作对我来说。

使用DOMxpath或regex删除<p> <br/> </p>？

3 个答案: