Question

根据the HTML Purifier smoketest，偶尔会丢弃'格式错误'的URI以留下无属性的锚标记，例如

<a href="javascript:document.location='http://www.google.com/'">XSS</a>变为<a>XSS</a>

......以及偶尔被剥夺协议，例如

<a href="http://1113982867/">XSS</a>变为<a href="http:/">XSS</a>

虽然这本身没有问题，但它有点难看。我没有试图用正则表达式去除这些，而是希望使用HTML Purifier自己的库功能/注入器/插件/ whathaveyou。

参考点：处理属性

有条件地删除HTMLPurifier中的属性很容易。此处，图书馆为课程HTMLPurifier_AttrTransform提供了方法 confiscateAttr() 。

虽然我个人不使用 confiscateAttr() 的功能，但我会根据this thread使用HTMLPurifier_AttrTransform添加target="_blank"所有的锚点。

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_Target();
// purify down here

HTMLPurifier_AttrTransform_Target当然是一个非常简单的课程。

class HTMLPurifier_AttrTransform_Target extends HTMLPurifier_AttrTransform
{
    public function transform($attr, $config, $context) {
        // I could call $this->confiscateAttr() here to throw away an
        // undesired attribute
        $attr['target'] = '_blank';
        return $attr;
    }
}

这部分自然就像一个魅力。

处理要素

也许我在HTMLPurifier_TagTransform没有足够的眯眼，或者我正在寻找错误的地方，或者一般都不理解它，但我似乎无法找到一种有条不紊的方法删除元素。

说，有效的事情：

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addElementHandler('a');
$anchor->elem_transform_post[] = new HTMLPurifier_ElementTransform_Cull();
// add target as per 'point of reference' here
// purify down here

使用Cull类扩展具有 confiscateElement() 能力或类似能力的东西，其中我可以检查缺少href属性或href属性内容为http:/。

HTMLPurifier_Filter

我知道我可以创建一个过滤器，但是示例（Youtube.php和ExtractStyleBlocks.php）建议我使用正则表达式，我真的宁愿避免，如果它完全是可能。我希望有一个板载或准板载解决方案，它利用HTML Purifier的出色解析功能。

遗憾的是，在null的子级别中返回HTMLPurifier_AttrTransform并未将其删除。

任何人都有任何聪明的想法，还是我坚持使用正则表达式？：）

Answer 1

成功！感谢Ambush Commander and mcgrailm in another question，我现在正在使用一个非常简单的解决方案：

// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');

// HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from
// all anchor tags (see first post for class detail)
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp();

// this is the magic! We're making 'href' a required attribute (note the
// asterisk) - now HTML Purifier removes <a></a>, as well as
// <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp
// is through with it!
$htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());

它有效，工作，bahahahaHAHAHAHAhhͥͤͫğͮ͑̆ͦó̓̉ͬ͋hͧ̆̈̉ğ̈͐̈a̾̈̑ͨô̔̄̑̇ḡh̘̝͊̐ͩͥ̋ͤ͛g̦̣̙̙̒ͥ̐̔o̤̣hg͓̈͋̇̓̆ä͖̩̯̥͕̐ͮ̒o̶ͬ̽̍ͮ̾ͮ͢҉̩͉̘͓̙̦̩̹͍̹̠̕g̵̡͔̙͉̠̙̩͚͑ͥ̓͛̋͗̍̽͋͑̈̚... ！ *狂躁的笑声，潺潺的声音，脸上露出微笑的龙骨*

Answer 2

您无法使用TagTransform删除元素这一事实似乎是一个实现细节。删除节点（比标签更高级别的smidge）的经典机制是使用Injector。

无论如何，您正在寻找的特定功能已经实现为％AutoFormat.RemoveEmpty

Answer 3

细读，这是我目前的解决方案。它有效，但完全绕过HTML Purifier。

/**
 * Removes <a></a> and <a href="http:/"></a> tags from the purified
 * HTML.
 * @todo solve this with an injector?
 * @param string $purified The purified HTML
 * @return string The purified HTML, sans pointless anchors.
 */
private function anchorCull($purified)
{
    if (empty($purified)) return '';
    // re-parse HTML
    $domTree = new DOMDocument();
    $domTree->loadHTML($purified);
    // find all anchors (even good ones)
    $anchors = $domTree->getElementsByTagName('a');
    // collect bad anchors (destroying them in this loop breaks the DOM)
    $destroyNodes = array();
    for ($i = 0; ($i < $anchors->length); $i++) {
        $anchor = $anchors->item($i);
        $href   = $anchor->attributes->getNamedItem('href');
        // <a></a>
        if (is_null($href)) {
            $destroyNodes[] = $anchor;
        // <a href="http:/"></a>
        } else if ($href->nodeValue == 'http:/') {
            $destroyNodes[] = $anchor;
        }
    }
    // destroy the collected nodes
    foreach ($destroyNodes as $node) {
        // preserve content
        $retain = $node->childNodes;
        for ($i = 0; ($i < $retain->length); $i++) {
            $rnode = $retain->item($i);
            $node->parentNode->insertBefore($rnode, $node);
        }
        // actually destroy the node
        $node->parentNode->removeChild($node);
    }
    // strip out HTML out of DOM structure string
    $html = $domTree->saveHTML();
    $begin = strpos($html, '<body>') + strlen('<body>');
    $end   = strpos($html, '</body>');
    return substr($html, $begin, $end - $begin);
}

我仍然宁愿拥有一个很好的HTML净化器解决方案，所以，就像单挑一样，这个答案不会自我接受。但是，如果没有更好的答案结束，至少它可能会帮助那些有类似问题的人。：）

HTML Purifier：根据其属性有条件地删除元素

参考点：处理属性

处理要素

HTMLPurifier_Filter

3 个答案: