我们有数千个隐藏式字幕XML文件,我们必须将其作为纯文本导入数据库,并保留HTML标记以转换为另一种CC格式。我能够很容易地提取纯文本,但似乎也无法找到提取原始HTML的正确方法。
有没有办法完成类似" ->htmlContent
"和->textContent
在下面工作的方式相同吗?
$ctx = stream_context_create(array('http' => array('timeout' => 60)));
$xml = @file_get_contents('http://blah-blah-blah/16TH.xml', 0, $ctx);
$dom = new DOMDocument;
$dom->loadXML($xml);
$ptags = $dom->getElementsByTagName( "p" );
foreach( $ptags as $p ) {
$text = $p->textContent;
}
正在处理的典型<p>
:
<p begin="00:00:14.83" end="00:00:18.83" tts:textAlign="left">
<metadata ccrow="12" cccol="8"/>
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
</p>
成功->textContent
结果
(male narrator) THE 16TH AND 17TH CENTURIES WERE THE FORMATIVE 200 YEARS
所需的 HTML结果
(male narrator)<br></br> THE 16TH AND 17TH CENTURIES<br></br> WERE THE FORMATIVE 200 YEARS
答案 0 :(得分:1)
换句话说,您希望保存特定节点 - br
元素和文本节点。您可以使用DOM + Xpath执行此操作:
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}
输出:
string(86) "
(male narrator)<br> THE 16TH AND 17TH CENTURIES<br> WERE THE FORMATIVE 200 YEARS
"
任何后代br
:.//br
任何后代文本节点:.//text()
结合表达式:.//br|.//text()
如果XML使用名称空间,则必须注册并使用它们。
$document = new DOMDocument();
$document->preserveWhiteSpace = false;
$document->loadXml($html);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('tt', 'http://www.w3.org/2006/04/ttaf1');
foreach ($xpath->evaluate('//tt:p') as $p) {
$content = '';
foreach ($xpath->evaluate('.//tt:br|.//text()', $p) as $node) {
$content .= $document->saveHtml($node);
}
var_dump($content);
}
答案 1 :(得分:0)
在我意识到由于strip_tags()
标记的结束标记后BR
失败后,我无法看到森林中的树木......这是一个非常简单的解决方案:
foreach( $ptags as $p ) {
$text = $p->textContent;
$html = $p->ownerDocument->saveXML($p); // Raw HTML
$html = str_ireplace('<br></br>','<br>',$html); // Cleanup the BR usage
$html = strip_tags($html,'<br>'); // Strip the tags I don't need
}
对于DOM或正则表达式来说,可能是一个更加优雅的解决方案,但这确实可以完成。