使用PHP cURL登录远程站点并获取站点数据

时间:2017-08-23 10:13:28

标签: php curl web-scraping

我正在尝试从登录用户的https://www.manheim.com获取数据。我已经为cpanel实现了相同的功能并使用了所有cpanel,但是没有为此站点工作。请帮我做同样的事。我通过检查登录页面来获取authenticity_token。执行成功登录后,我会通过抓取自动抓取它。

cpanel代码:

$url="http://example.com:2082/login/?login_only=1"; 
$pass = 'pass';

$postinfo = "user=user";
$postinfo .= "&pass=".$pass;
$postinfo .= "&submit=Login";

$cookie_file_path = $path."/cookie.txt";

$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);

curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
//set the cookie the site has for certain features, this is optional
curl_setopt($ch, CURLOPT_COOKIE, "cookiename=0");
curl_setopt($ch, CURLOPT_USERAGENT,
    "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
$login = curl_exec($ch);

$login = json_decode($login);

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
//page with the content I want to grab
curl_setopt($ch, CURLOPT_URL, "http://example.com:2082". $login->redirect);
//do stuff with the info with DomDocument() etc
$html = curl_exec($ch);
curl_close($ch);

print_r($html);

manheim.com的代码

$url ="https://www.manheim.com/login/authenticate"; 
$ch = curl_init();      
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  
curl_setopt($ch, CURLOPT_URL, $url); 
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY); 
curl_setopt($ch, CURLOPT_COOKIE, "cookiename=0");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36");
curl_setopt ($ch, CURLOPT_COOKIEJAR, $path . "/cookie.txt"); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, "utf8=✓&authenticity_token=vsB5lCaB0rumkZxm940HWgMxSecpsjDXMGJxYHDbU5g=&user[username]=user&user[password]=pass&submit=Login"); 
ob_start();      
curl_exec ($ch); 
ob_end_clean();  
curl_close ($ch); 
unset($ch); 

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); 
curl_setopt($ch, CURLOPT_COOKIEFILE, $path . "/cookie.txt"); 
curl_setopt($ch, CURLOPT_URL, "https://www.manheim.com/members/powersearch/keywordSearchResults.do?searchTerms=WA1CNAFY1J2000316"); 
$result = curl_exec ($ch); 

curl_close ($ch); 
echo $result; 

1 个答案:

答案 0 :(得分:0)

在您的cpanel代码中,用户名和密码不是urlencoded,因此如果它们包含&=或空格或制表符或ÆØÅ或其他一些字符,那么您就是&#39 ; ll发送错误的凭据,并且不会登录。这只是因为你很幸运而且你的密码没有包含任何特殊字符,你的cpanel登录代码曾经工作过,修复过你正在使用urlencode()。另外,不要将CURLOPT_CUSTOMREQUEST用于GET请求(改为使用CURLOPT_HTTPGET,但这是默认模式),也不是POST请求(改为使用CURLOPT_POST)。至于manheim.com,似乎你和cpanel一样犯了urlencode错误。但更重要的是,您的authenticity_token是硬编码的。永远不会起作用的是,authenticity_token与您的浏览器绑定,对您的浏览器的会话cookie ID而言是独一无二的,并且可能很久以前就已过期。硬编码它不会起作用。相反,向https://www.manheim.com/发出GET请求,在标题中您将获得唯一的会话Cookie ID,您必须在以后的所有请求中发送此Cookie,因为如果您不这样做,那么&#39} ;就像切换浏览器一样,这个网站很奇怪,如果你还没有cookie会话ID,你就看不到登录页面了,如果你试试,你就可以了。将http-header重定向到伪造的URL ,然后发出一个GET请求,其cookie ID为https://www.manheim.com/login/,然后在html中,您将获得与您的cookie绑定的唯一authenticity_token会话ID。从html中解析出authenticity_token,然后将它们与静态数据(用户名和密码和提交)一起添加到https://www.manheim.com/login/authenticate的下一个请求的POST正文中,并确保它们全部都是urlencoded,然后您应该登录。并继续获取登录页面,继续发送相同的cookie会话ID,因为您的登录与该ID相关联,如果您不发送它,Web服务器不记得你已经登录,而且你会得到一些"你需要登录才能看到这个页面"错误。

这是一个示例实现(使用来自https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php的hhb_curl作为curl_函数的便利包装器,将无声错误转换为异常,处理cookie等等)

<?php
declare(strict_types = 1);
require_once ('hhb_.inc.php');
hhb_init(); // better error reporting
const USERNAME = '?';
const PASSWORD = '??';
$hc = new hhb_curl ( '', true );

header ( "content-type: text/plain;charset=utf8" );
$hc->exec ( 'https://www.manheim.com/' ); // getting a cookie session id
$html = $hc->exec ( 'https://www.manheim.com/login/' )->getResponseBody (); // getting authenticity_token , required for logging in
$domd = @DOMDocument::loadHTML ( $html );
$xp = new DOMXPath ( $domd );
$token = $xp->query ( '//input[@name="authenticity_token"]' )->item ( 0 )->getAttribute ( "value" );
$hc->setopt_array ( array (
        CURLOPT_URL => 'https://www.manheim.com/login/authenticate',
        CURLOPT_POST => true,
        CURLOPT_POSTFIELDS => http_build_query ( array (
                'utf8' => '✓',
                'authenticity_token' => $token,
                'user' => array (
                        'username' => USERNAME,
                        'password' => PASSWORD 
                ),
                'submit' => 'Login' 
        ) ) 
) )->exec ();
$html = $hc->getResponseBody ();
$domd = @DOMDocument::loadHTML ( $html );
$xp = new DOMXPath ( $domd );
$errmsg = $xp->query ( '//*[contains(@class,"msgError")]' );
if ($errmsg->length > 0) {
    echo 'Error logging in: ' . $errmsg->item ( 0 )->textContent;
} else {
    echo 'logged in!';
}
hhb_var_dump ( $token, $hc->getStdErr (), $hc->getStdOut () );

现在,它打印: Error logging in: The username and password you entered do not match any accounts on record. Please make sure that you have correctly entered your username and password. Both are case-sensitive and should not contain any special characters. If you have forgotten your username or password, please use our username retrieval or password retrieval tools.

但是如果你在第4行和第5行更改用户名和密码,它可能会改为登录,然后说logged in!(然后打印一堆调试数据)

  • ps,在上面的代码中,我使用http_build_query而不是urlencode()来执行urlencoding。当有几个要编码的变量时,给一个http_build_query调用一个数组通常会产生比手动urlencode()所有内容更漂亮的代码:)
相关问题