从网页内容中提取子字符串

时间:2013-11-27 07:01:05

标签: php substr

我使用file_get_content解析网页数据。现在我想取出前150个字符作为该网址的描述。

                $url = 'http://crewow.com/CSS_Layout_Tutorial.php';
                $data = file_get_contents($url);
                $content = plaintext($data);
                $Preview = trim_display(140,$content); //to Show first 100 char of the web page as preview
                echo $Preview;

    function trim_display($size,$string)

        {

            echo "string is  : $string <br/>";

            $trim_string = substr($string, 0, 150);

            $trim_string = $trim_string . "...";
            echo "Trim string is  $trim_string <br/>";
            return $trim_string;
           }

function plaintext($html)
{
    $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);
    // remove title 

        //$plaintext = preg_match('#<title>(.*?)</title>#', $html);
    // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
    $plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);

    // put a space between list items (strip_tags just removes the tags).
        $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);     

        // remove all script and style tags
    $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
    // remove br tags (missed by strip_tags)
        // remove all remaining html
        $plaintext = strip_tags($plaintext);
    return $plaintext;

}

此代码适用于某些网址。很少有人在$ Preview中没有显示任何内容。 数据已正确发送至trim_display()但未通过$trim_string = substr($string, 0, 150);

此remail的输出为空。

1 个答案:

答案 0 :(得分:2)

实际上用户代码是正确的,并且工作也正确。但不幸的是,没有返回任何150个字符的角色。试试5000。

$trim_string = substr($string, 0, 5000);

要了解此问题,请参阅查看源。

您可以使用此代码而不是您的代码,并且肯定会起作用:

$url = 'http://crewow.com/CSS_Layout_Tutorial.php';
 $data = file_get_contents($url);
 $content = plaintext($data);
 //echo trim($content);
 $Preview = trim_display(150,trim($content)); //to Show first 100 char of the web page as preview
 echo $Preview;

 function trim_display($size,$string)
 {

            //echo "string is  : $string <br/>";

            $trim_string = substr($string, 0, 150);

            $trim_string = $trim_string . "...";
            //echo "Trim string is  $trim_string <br/>";
            return $trim_string;
 }

function plaintext($html)
{
    $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);
    // remove title 

        //$plaintext = preg_match('#<title>(.*?)</title>#', $html);
    // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
    $plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);

    // put a space between list items (strip_tags just removes the tags).
        $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);     

        // remove all script and style tags
    $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
    // remove br tags (missed by strip_tags)
        // remove all remaining html
        $plaintext = strip_tags($plaintext);
    return $plaintext;

}