Goutte Scrape登录https安全网站

时间:2015-03-17 05:53:55

标签: symfony ssl curl web-scraping goutte

所以我尝试使用Goutte登录 https 网站,但收到以下错误:

cURL error 60: SSL certificate problem: unable to get local issuer certificate 500 Internal Server Error - RequestException 1 linked Exception: RingException

这是Goutte的创建者所说的代码:

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'http://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' =>     'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

或者这是Symfony推荐的代码:

use Goutte\Client;

// make a real request to an external site
$client = new Client();
$crawler = $client->request('GET', 'https://github.com/login');

// select the form and fill in some values
$form = $crawler->selectButton('Log in')->form();
$form['login'] = 'symfonyfan';
$form['password'] = 'anypass';

// submit that form
$crawler = $client->submit($form);

问题是它们都不起作用,我得到了上面发布的错误。我 CAN ,但是使用过去问题中写的代码登录我问过:cURL Scrape then Parse/Find Specific Content

我只想使用Symfony / Goutte登录,因此抓取我需要的数据会更容易。有任何帮助或建议吗?谢谢!

1 个答案:

答案 0 :(得分:4)

在代码中添加以下内容可修复错误(curl配置):

    // make a real request to an external site
    $client = new Client();
    $client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYHOST, FALSE);
    $client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYPEER, FALSE);
    $crawler = $client->request('GET', 'https://github.com/login'); 

但是又出现了另一个错误:

The current node list is empty.
500 Internal Server Error - InvalidArgumentException 

再一次,我正在使用Goutte和Symfony以及默认代码来执行测试任务,例如登录到https github。

关于node list empty的上一个错误的修复是Github登录页面按钮实际上在按钮上显示“登录”而不是提交登录 。不幸的是,Goutte api不清楚if $form = $crawler->selectButton('Sign in')->form();是指html name属性还是按钮的实际纯文本。这显然是纯文本;有点混乱。因此,在对一个记录不佳的api进行更多研究之后,我结束了以下有效的代码:

// make a real request to an external site
$client = new Client();
$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYHOST, FALSE);
$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_SSL_VERIFYPEER, FALSE);
$crawler = $client->request('GET', 'https://github.com/login');

// select the form and fill in some values
$form = $crawler->selectButton('Sign in')->form();
$form['login'] = 'symfonyfan';
$form['password'] = 'anypass';

// submit that form
$crawler = $client->submit($form);
echo $crawler->html();