简单的HTML Parser问题

时间:2013-05-11 01:18:20

标签: php html parsing html-parsing simple-html-dom

您好我正在尝试解析ratemyprofessor网站上的教授姓名和评论,并将每个div转换为明文。这是我正在使用的div类结构。

<div id="ratingTable">
<div class="ratingTableHeader"></div>
<div class="entry odd"><a name="18947089"></a>
<div class="date">
  8/24/11  // the date which I want to parse
</div><div class="class"><p>
  ENGL2323 // the class which I want to parse
</p></div><div class="rating"></div><div class="comment" style="width:350px;">
  <!-- comment section -->
<p class="commentText">    // this is what I want to parse as plaintext for each entry
  I have had Altimont for 4 classes. He is absolutely one of my favorite professors at St. Ed's. He's generous with his time, extremely knowledgeable, and such an all around great guy to know. Having class with him he would always have insightful comments on what we were reading, and he speaks with a lot of passion about literature. Just the best!
</p><div class="flagsIcons"></div></div>
  <!-- closes comment -->
</div>
  <!-- closes even or odd -->
<div class="entry even"></div> // these divs are the entries for each professor 
  <!-- closes even or odd -->
<div class="entry odd"></div>
  <!-- closes even or odd -->
</div>
  <!-- closes rating table -->

所以每个条目都封装在这个“ratingtable”div下,每个条目都是“entry odd”或“entry even”div。

这是我到目前为止的尝试,但它只会产生一个带有大量垃圾的巨大乱码阵列。

<?php
header('Content-type: text/html; charset=utf-8'); // this just makes sure encoding is right
include('simple_html_dom.php'); // the parser library

$html = file_get_html('http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=834'); // the url for the teacher rating profile

//first attempt, rendered nothing though 

  foreach($html->find("div[class=commentText]") as $content){
      echo $content.'<hr />';
  }

 foreach($html->find("div[class=commentText]") as $content){
  $content = <div class="commentText">  //  first_child() should be the <p>
  echo $content->first_child().'<hr />';

 //Get the <p>'s following the <div class="commentText">

     $next = $content->next_sibling();
    while ($next->tag == 'p') {
        echo $next.'<hr />';
        $next = $next->next_sibling();
    }
}
?>

2 个答案:

答案 0 :(得分:0)

混淆HTML ...你能试试看看是否有效吗?

foreach (DOM($html, '//div[@class="commentText"]//div[contains(@class,"entry")]') as $comment)
{
    echo strval($comment);
}

哦,是的 - 我不喜欢simple_html_dom,请改用它:

function DOM($html, $xpath = null, $key = null, $default = false)
{
    if (is_string($html) === true)
    {
        $dom = new \DOMDocument();

        if (libxml_use_internal_errors(true) === true)
        {
            libxml_clear_errors();
        }

        if (@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')) === true)
        {
            return DOM(simplexml_import_dom($dom), $xpath, $key, $default);
        }
    }

    else if (is_object($html) === true)
    {
        if (isset($xpath) === true)
        {
            $html = $html->xpath($xpath);
        }

        if (isset($key) === true)
        {
            if (is_array($key) !== true)
            {
                $key = explode('.', $key);
            }

            foreach ((array) $key as $value)
            {
                $html = (is_object($html) === true) ? get_object_vars($html) : $html;

                if ((is_array($html) !== true) || (array_key_exists($value, $html) !== true))
                {
                    return $default;
                }

                $html = $html[$value];
            }
        }

        return $html;
    }

    return false;
}

答案 1 :(得分:0)

如果您仍想使用simple_html_dom ..请参阅以下代码,了解代码中的错误:

<?php
header('Content-type: text/html; charset=utf-8'); // this just makes sure encoding is right
include('simple_html_dom.php'); // the parser library

// you were trying to parse the wrong link.. your previous link did not have <div> tag with commentText class .. I chose a random link.. choose link for whichever professor you like or grab the links of professor from previous page store it in an array and loopr through them to get comments

$html = file_get_html('http://www.ratemyprofessors.com/ShowRatings.jsp?tid=1398302'); // the url for the teacher rating profile

//first attempt, rendered nothing though 

 //your div tag has class "comment" not "commentText"
  foreach($html->find("div[class=comment]") as $content){
      echo $content.'<hr />';
  }


 foreach($html->find("div[class=comment]") as $content){

 // I am not sure what you are trying to do here but watever it is it's wrong
 //$content = <div class='commentText'>";  //  first_child() should be the <p>
 //echo $content->first_child().'<hr />';

  //correct way to do it
  echo $html->firstChild();// ->first_child().'<hr />';


//this whole code does not make any sense since you are already retrieving the comments from the above code.. but if you still want to use it .. I can figure out what to do

 //Get the <p>'s following the <div class="commentText">
//     $next = $html->firstChild()->next_sibling();
//    while ($next->tag == 'p') {
//        echo $next.'<hr />';
//        $next = $next->next_sibling();
//     }
    }
?>

输出

Comment

Dr.Alexander was the best. I would recommend him for American Experience or any class he teaches really. He is an amazing professor and one of the nicest most kind hearted people i've ever met.
Report this rating

Professor Alexander is SO great. I would recommend him to everyone for american experience. He has a huge heart and he's really interested in getting to know his students as actual people. The class isn't difficult and is super interesting. He's amazing.
Report this rating

DINS