如何正确地从html字符串中获取一些节点?

时间:2017-05-24 13:57:13

标签: php domdocument

我尝试从我给定的html字符串中抓取一些节点:

$html = <<<'HTML'
<h1>Details au&szlig;en</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;

我需要字符串中的第一个h1和最后一个img节点。

为此,我使用了DOMDocument,因为使用preg_match_all或类似的东西我们可能会遗漏一些东西。

完整代码:

$html = <<<'HTML'
<h1>Details au&szlig;en</h1>
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
HTML;

$dom = new \DOMDocument();
// since the libxml was designed for ISO-8859-1, this is a backwards hack
// @see https://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters/11310258
$dom->loadHTML(iconv('UTF-8', 'ISO-8859-1', $html),
    \LIBXML_HTML_NOIMPLIED
);
$h1List = $dom->getElementsByTagName('h1');
$h1 = $h1List->item(0);
$imgList = $dom->getElementsByTagName('img');
$img = $imgList->item($imgList->length - 1);

$data = array(
    'tabTitle' => trim($dom->saveHTML($h1)),
    'tabImg' => trim($dom->saveHTML($img))
);


// remove both wrappers if empty
$imgWrapper = $img->parentNode;
$imgWrapper->removeChild($img);

if (!$imgWrapper->hasChildNodes()) {
    $imgWrapper->parentNode->removeChild($imgWrapper);
}

$h1Wrapper = $h1->parentNode;
$h1Wrapper->removeChild($h1);

if (!$h1Wrapper->hasChildNodes()) {
    $h1Wrapper->parentNode->removeChild($h1Wrapper);
}

$data['content'] = $dom->saveHTML();

var_dump($data);

预期产出:

array(
    'tabTitle' => '<h1>Details außen</h1>',
    'tabImg' => '<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path=\'media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg\'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">',
    'content' => '
<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Au&szlig;enseite [...]</p>
<p class="own-branding">[...]</p>
<p>
'
);

但我得到了以下输出:

array(3) {
  'tabTitle' =>
  string(501) "<h1>Details außen<h1>Schreibmappe DIN A4</h1>
<hr>
<p>Die Außenseite [...]</p>
<p class="own-branding">[...]</p>
<p><img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"></p>
</h1>"
  'tabImg' =>
  string(373) "<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="%7Bmedia%20path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'%7D" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg">"
  'content' =>
  string(108) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

"
}

这里有什么问题?我使用的是PHP 5.6。如果问题与PHP版本相关,则可以更改为PHP 7。

1 个答案:

答案 0 :(得分:0)

这应该让你盯着看。首先,我使用xpath查询DOMDocument,然后使用saveXML打印节点。

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXpath($dom);

$nodes[] = $xpath->query('//h1')[0];
$nodes[] = $xpath->query('//img')[0];

foreach ($nodes as $node) {
    echo utf8_decode($dom->saveXML($node)) . PHP_EOL;
}

这是您的示例的输出:

<h1>Details außen</h1>
<img id="tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" class="tinymce-editor-image tinymce-editor-image-d52f7e72-4c4f-4cdc-86e1-5d8889bf1159" src="{media path='media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg'}" alt="07-7206-56_geschlossen_VS5458e3fd87895" width="274" height="339" data-src="media/image/07-7206-56_geschlossen_VS5458e3fd87895.jpg"/>

您可以将其格式化为所需的输出