Question

网址1：https://duapp3.drexel.edu/webtms_du/

网址2：https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX

网址3：https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX

作为个人编程项目，我想抓取我的大学课程目录并将其作为RESTful API提供。

但是，我遇到了以下问题。

我需要抓取的页面是URL3。但是URL3仅在我访问URL2后返回有意义的信息（它将术语设置为Colleges.asp?Term=201125），但只能在访问URL1后访问URL2。

我尝试使用Fiddler监控来回的HTTP数据，我认为他们不使用cookie。关闭浏览器会立即重置所有内容，因此我怀疑他们正在使用Session。

如何抓取网址3？我以编程方式尝试首先访问URL 1和2，然后执行file_get_contents(url3)，但这不起作用（可能是因为它注册为三个不同的会话。

Answer 1

会话也需要一种机制来识别您。常用方法包括：Cookie，URL中的会话ID。

网址1上的curl -v显示确实正在设置会话Cookie。

Set-Cookie: ASPSESSIONIDASBRRCCS=LKLLPGGDFBGGNFJBKKHMPCDA; path=/

您需要在任何后续请求中将此cookie发回服务器，以保持会话的活跃。

如果您想使用file_get_contents，则需要使用stream_context_create手动为其创建上下文，以便在请求中包含Cookie。

另一种选择（我个人更喜欢）是使用PHP提供的curl functions。（它甚至可以为你处理cookie流量！）但这只是我的偏好。

编辑：

这是一个可以解决问题路径的工作示例。

$scrape = array(
    "https://duapp3.drexel.edu/webtms_du/",
    "https://duapp3.drexel.edu/webtms_du/Colleges.asp?Term=201125&univ=DREX",
    "https://duapp3.drexel.edu/webtms_du/Courses.asp?SubjCode=CS&CollCode=E&univ=DREX"
);

$data = '';
$ch = curl_init();

// Set cookie jar to temporary file, because, even if we don't need them, 
// it seems curl does not store the cookies anywhere otherwise or include
// them in subsequent requests
curl_setopt($ch, CURLOPT_COOKIEJAR, tempnam(sys_get_temp_dir(), 'curl'));

// We don't want direct output by curl
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Then run along the scrape path
foreach ($scrape as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
}

curl_close($ch);

echo $data;

将会话设置为刮页

1 个答案: