无法正常获取网页内容

时间:2015-02-02 10:52:10

标签: php

我正在尝试获取网页内容以提取rss链接。我已经写了以下代码。它获取了网页内容,但它删除了我需要的部分内容!

<?php
function getUrl($url)
{
    $ch = curl_init(); 
    $timeout = 5; // set to zero for no timeout 
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
    curl_setopt ($ch, CURLOPT_URL, $url); 
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 
    print_r($ch); 
    curl_close($ch); 
    return $file_contents;
}

echo getUrl("http://www.journaltocs.ac.uk/index.php?action=browse&subAction=pub&publisherID=10&local_page=1&sorType=DESC&sorCol=2&pageb=1");

?>

这就是我需要的上述网址,其中包含一个标题=&#34;期刊TOC RSS提要和#34;的链接。

<p style="text-align:left;">Publisher: <b><a href="http://www.law.ed.ac.uk/ahrc" target="_blank"><b>AHRC Research Centre</b></a> <a href="http://www.law.ed.ac.uk/ahrc" title="Publisher Homepage" target="_blank"><img src="images/link_external.png" border="0" style="vertical-align:middle;margin:0;"></a> &nbsp; </b> (Total: 1 journals)</p><table style="width:100%"><tr valign="top"><td style="width:25px;"><input type="checkbox" class="nobox" id="search_result_journal_19827xxx19039" name="journal[]" onclick="process_journal_tick(this, 'my_tocs');" value="19827xxx19039"  /></td><td><a href="index.php?action=browse&subAction=pub&publisherID=10&journalID=19827&pageb=1&userQueryID=&sort=&local_page=1&sorType=DESC&sorCol=2">SCRIPTed - A J. of Law, Technology & Society</a> &nbsp; &nbsp; <a href="http://www.law.ed.ac.uk/ahrc/script-ed/index.asp" title="Journal Homepage" target="_blank"><img src="images/layout_elements/triangle.png" border="0" style="vertical-align:middle;margin:0;"></a> <a href="http://feeds.feedburner.com/Script-ed?format=xml" title="Journal TOC RSS feeds" target="_blank"><img src="images/icon_feed.jpg" border="0" style="vertical-align:middle;margin:0;"></a> <img src="images/icon_oa.jpg" border="0" style="vertical-align:middle;margin:0;" title="Open Access" alt="Open Access">  &nbsp; <span style="color:#A8A8A8;">(<span style="color:#808080;">Followers:</span> 7)</span> </td>
</tr></table>

但我从代码中得到的是:

<p style="text-align:left;">Publisher: <b><a href="http://www.law.ed.ac.uk/ahrc" target="_blank"><b>AHRC Research Centre</b></a> <a href="http://www.law.ed.ac.uk/ahrc" title="Publisher Homepage" target="_blank"><img src="images/link_external.png" border="0" style="vertical-align:middle;margin:0;"></a> &nbsp; </b> (Total: 1 journals)</p><table style="width:100%"><tr valign="top"><td style="width:25px;"><input type="checkbox" class="nobox" id="search_result_journal_19827xxx0" name="journal[]" onclick="process_journal_tick(this, 'my_tocs');" value="19827xxx0"  /></td><td><a href="index.php?action=browse&subAction=pub&publisherID=10&journalID=19827&pageb=1&userQueryID=&sort=&local_page=1&sorType=DESC&sorCol=2">SCRIPTed - A J. of Law, Technology & Society</a> &nbsp; &nbsp; <a href="http://www.law.ed.ac.uk/ahrc/script-ed/index.asp" title="Journal Homepage" target="_blank"><img src="images/layout_elements/triangle.png" border="0" style="vertical-align:middle;margin:0;"></a>  <img src="images/icon_oa.jpg" border="0" style="vertical-align:middle;margin:0;" title="Open Access" alt="Open Access">  &nbsp; <span style="color:#A8A8A8;">(<span style="color:#808080;">Followers:</span> 7)</span> </td>
</tr></table> 

如您所见,链接标题为&#34;期刊TOC RSS Feed&#34;已被删除!!!!

我已经使用file_get_content($ url)进行了检查,但它没有帮助! 你能帮我解决一下吗?!我不知道问题是什么!

提前致谢

1 个答案:

答案 0 :(得分:1)

function SendCurl($url, $post, $post_data, $user_agent, $cookies){  

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);

    if($user_agent)
        curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    if($post) {
        curl_setopt($ch, CURLOPT_POST, true);
        curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
    }

    if($cookies) {
        curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
    } else {
        curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
    }

    $response = curl_exec($ch);
    $http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    return array($http, $response);
}

$login_email = "your email";
$login_password = "your password";
$login_url = "http://www.journaltocs.ac.uk/?action=login";
$user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5'; //optional
$login_data = array(
    'f_user'=>$login_email,
    'f_pass'=>$login_password
);

$webpage_url = "http://www.journaltocs.ac.uk/index.php?action=browse&subAction=pub&publisherID=10&local_page=1&sorType=DESC&sorCol=2&pageb=1";

try{
    //login first and save cookies
    $response = SendCurl($login_url,true,$login_data,$user_agent);
    //if login failed
    if( strpos($response[1],"Username or Password is incorrect") )
        throw new Exception("Username or Password is incorrect");

    //start fetch webpage
    $response = SendCurl($webpage_url,false,false,$user_agent,"cookies.txt");
    if( strpos($response[1],"Journal TOC RSS feeds") )
        die("Journal TOC RSS feeds button is found");

}catch(Exception $e){
    die($e->getMessage());
}

RSS Feed图标仅在您记录时显示

因此,您需要在获取网页内容之前登录