如何获取纯文本和从XML创建的DOM元素的HTML?

时间:2015-11-01 15:00:39

标签: php xml dom

我们有数千个隐藏式字幕XML文件,我们必须将其作为纯文本导入数据库,并保留HTML标记以转换为另一种CC格式。我能够很容易地提取纯文本,但似乎也无法找到提取原始HTML的正确方法。

有没有办法完成类似" ->htmlContent"和->textContent在下面工作的方式相同吗?

$ctx = stream_context_create(array('http' => array('timeout' => 60)));
$xml = @file_get_contents('http://blah-blah-blah/16TH.xml', 0, $ctx);

$dom = new DOMDocument;
$dom->loadXML($xml);
$ptags = $dom->getElementsByTagName( "p" );
foreach( $ptags as $p ) {
    $text   = $p->textContent;
}

正在处理的典型<p>

<p begin="00:00:14.83" end="00:00:18.83" tts:textAlign="left">
    <metadata ccrow="12" cccol="8"/>
    (male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
</p>

成功->textContent结果

(male narrator) THE 16TH AND 17TH CENTURIES WERE THE FORMATIVE 200 YEARS

所需的 HTML结果

(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS

2 个答案:

答案 0 :(得分:1)

换句话说,您希望保存特定节点 - br元素和文本节点。您可以使用DOM + Xpath执行此操作:

$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);

foreach ($xpath->evaluate('//p') as $p) {
  $content = '';
  foreach ($xpath->evaluate('.//br|.//text()', $p) as $node) {
    $content .= $document->saveHtml($node);
  }
  var_dump($content);
}

输出:

string(86) "
    (male narrator)<br> THE 16TH AND 17TH CENTURIES<br> WERE THE FORMATIVE 200 YEARS
"

Xpath表达式

任何后代br.//br
任何后代文本节点:.//text()
结合表达式:.//br|.//text()

命名空间

如果XML使用名称空间,则必须注册并使用它们。

$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('tt', 'http://www.w3.org/2006/04/ttaf1');

foreach ($xpath->evaluate('//tt:p') as $p) {
  $content = '';
  foreach ($xpath->evaluate('.//tt:br|.//text()', $p) as $node) {
    $content .= $document->saveHtml($node);
  }
  var_dump($content);
}

答案 1 :(得分:0)

在我意识到由于strip_tags()标记的结束标记后BR失败后,我无法看到森林中的树木......这是一个非常简单的解决方案:

foreach( $ptags as $p ) {
    $text = $p->textContent;
    $html = $p->ownerDocument->saveXML($p);         // Raw HTML
    $html = str_ireplace('<br></br>','<br>',$html); // Cleanup the BR usage
    $html = strip_tags($html,'<br>');               // Strip the tags I don't need
}

对于DOM或正则表达式来说,可能是一个更加优雅的解决方案,但这确实可以完成。