获取元素的innerHTML,但不是元素本身

时间:2016-04-27 18:37:58

标签: php regex domdocument

我正在从2列表中提取数据。第一列是变量名,第二列是该变量的数据。

我有这个几乎工作,但有些数据可能包含HTML,并且通常包含在DIV中。我想在DIV中获取HTML,但不是DIV本身。我知道正则表达式可能是一个解决方案,但我想更好地理解DOMDocument。

这是我到目前为止的代码:

private function readHtml()
{

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);

    $dom        = new \DOMDocument();
    $html       = $dom->loadHTML($htmlData);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

               $htmlNode = $node->getElementsByTagName('div');

                if($htmlNode->length >=1) {

                    $innerHTML= '';

                    foreach ($htmlNode as $innerNode) {

                        $innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
                    }

                    $value = $innerHTML;

                } else {

                    $value = $node->textContent;
                }
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

我的输出是正确的,但我不想包含包含HTML的数据的包装DIV:

    Array
    (
        [type] => raw
        [direction] => north
        [intro] => Welcome to the test. 
        [html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
        [count] => 1003
    )

更新

根据答案中的一些反馈和想法,这是函数的当前迭代,它更细,并返回所需的输出。我对双重正则表达感觉不太好,但它的工作原理。

private function readHtml()
{

    # the url given in your example
    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $dom = new \DOMDocument();
    $dom->loadHTMLFile($url);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

                $value = $node->ownerDocument->saveHTML( $node );

                $value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
                $value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

2 个答案:

答案 0 :(得分:1)

使用preg_replace!像这样:

$table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);

preg_replace查看<?php include 'simple_html_dom.php';//<--- Must download to current directory $url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml'; $html = file_get_html( $url ); foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) { echo $element->innertext; //See http://simplehtmldom.sourceforge.net/manual.htm for usage } ?> 。有关正则表达式的使用,请参阅here

OR!您可以像这样使用here

var possibles = [];

for(i=1; i<=78; i++) {
    possibles.push(i);
}

答案 1 :(得分:1)

你走在正确的轨道上!下一级是学习非常强大的 xpath语句,像DomDocument提供的解析器。请考虑以下代码示例:

<?php
# the url given in your example    
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

$doc = new \DOMDocument();
$doc->loadHTMLFile($url);

$xpath = new \DOMXpath($doc);

# here comes the magic
$html_body = $xpath->query("//td[text()='html_body']")->item(0);
$div_text = $html_body->nextSibling->textContent;
echo $div_text;
?>

线索是向DOM查询哪个文本节点等于html_body的列,这是通过//td[here comes the expression to filter on all columns in the dom]完成的。之后,只需要下一个兄弟姐妹。考虑到这一点,您甚至可以使用foreach在waffle表中的所有行上重写整个函数:

foreach($xpath->query("//table[@class='waffle']//tr") as $row) {
    // do sth. useful here
}

对于您的具体示例,这可能是(这有点短,不是吗?):

<?php
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);

$xpath = new \DOMXpath($doc);

foreach ($xpath->query("//table[@class='waffle']//tr") as $row) {
    $columns = $xpath->query("./td", $row);

    $key_td = $columns->item(0);
    $value_td = $columns->item(1);
    echo "[" . $key_td->nodeValue . "]: " . $value_td->nodeValue . "\n";
}

?>