Question

为了翻译网站，我需要找出html标签之间的文字。

我的第一种方法是使用正则表达式，但它不够灵活。我能用正则表达式得到的最接近的是：http://regex101.com/r/qB6xU5/1

但它只会在上一次测试中失败，在一个匹配中匹配 p 标记而不是两个

我考虑使用dom解析器库，但却无法（在很少的搜索中）找到一个可以满足我需求的解析器。

更不用说html可能带有错误和聪明的模板标签。

以下是一些应该通过的示例案例和结果：

<div>test</div> =＆gt; test
<div><br />test</div> =＆gt; <br />test
<div>te<br />st</div> =＆gt; te<br />st
<div>test<br /></div> =＆gt; test<br />
<div><span>my</span>test</div> =＆gt; <span>my</span>test
<div>test<span>my</span></div> =＆gt; test<span>my</span>
<div>test<span>my</span>test</div> =＆gt; test<span>my</span>test
<div><span>my</span>test<span>my</span></div> =＆gt; <span>my</span>test<span>my</span>

简单来说，它可以改写为： 查找包含至少一个未包含在某些标记中的字符串的html标记的内容。

Answer 1

不要使用正则表达式。使用HTML解析器！

以下是PHP Simple HTML DOM Parser的示例，但您可以按照自己的喜好进行操作：

$html = str_get_html('<div>test<br /></div>');
$div = $html->first_child(); // Here's the div
$result = "";
for($children = $div->first_child; $children; $children = $children->next_sibling()) {
  $result += $children;
}
echo $result; // => "test<br />"

Answer 2

这里的记录是完整的代码。在某些情况下，某些正则表达式可能不是必需的。但我需要所有这些;）

<?php
include("simple_php_dom.php");

// load html content to parse
$html_str = file_get_contents("myfile.tpl");
$html = str_get_html($html_str);

// extract strings
parse($html, $results);
var_dump($results); // simply display

/**
 * Parse html element and find every text not between tags
 * @param $elem DOM element to parse
 * @param $results array
 */
function parse($elem, &$results) {
    // walk though every nodes
    foreach($elem->childNodes() as $child) {
        // get sub children
        $children = $child->childNodes();

        // get inner content
        $content = $child->innertext;

        // remove starting and ending self closing elements or smarty tags
        $content = preg_replace('/(^(\s*<[^>]*?\/\s*>)+)|((<[^>]*?\/\s*>\s*)+$)/s', '', $content);
        $content = preg_replace('/(^(\s*{[^}]*?})+)|((\{[^}]*?\}\s*)+$)/s', '', $content);
        $content = trim($content);

        // remove all elements and smarty tags
        $text = preg_replace('/<(\w+)[^>]*>.*<\s*\/\1\s*>/', '', $content); // remove elements
        $text = preg_replace('/<\/?.*?\/?>/', '', $text); // remove self closing elements
        $text = preg_replace('/\{.*?\}/', '', $text); // remove smarty tags
        $text = preg_replace('/[^\w]/', '', $text); // remove non alphanum characters
        $text = trim($text);

        // no children, we are at a leaf and it's probably a text
        if(empty($children)) {
            // check if not empty string and exclude comments styles and scripts
            if(!empty($text) && in_array($child->tag, array("comment","style","script")) === false) {
                // add to results
                $results[] = $content;
            }
        }
        // if we are on a branch but in contain text not inside tags
        elseif(!empty($text)) {
            // add to results
            $results[] = $content;
        } else {
            // recursive call with sub element
            parse($child, $results);
        }
    }
}

PHP - 标签之间的文本

2 个答案: