刮刮使用cookie并使用python登录的aspx网站

时间:2014-06-26 07:54:13

标签: python http pdf cookies python-requests

我试图从snl.com上抓取一些pdf。我有付费订阅和有效的登录凭据。

其中一个pdf文件的网址是:http://www.snl.com/interactivex/file.aspx?Id=10735427&KeyFileFormat=PDF

手动登录并访问上述网址后,在浏览器中呈现pdf时,地址栏中的实际网址为:http://ofccolo.snl.com/Cache/44D87724CE10735427.PDF?CachePath=%5c%5cdmzdoc2%5cwebcache%24%5c&T=&O=PDF&Y=&D=

当我访问该网址时,我被重定向到https://www.snl.com/interactivex/default.aspx - 一个登录页面。

我已经在SO中阅读了几个关于Python requests的线程,并尝试使用以下代码来浏览登录页面并处理cookie,但我仍然继续将登录页面作为响应说明:"If you are already a registered SNL user, log in using your email address and password."

import requests, sys
from requests.packages.urllib3 import add_stderr_logger

add_stderr_logger()
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0'

name_form = 'username'
password_form = 'Password'
login = {name_form: 'my_email_id', password_form: 'my_password'}
login_response = s.post("https://www.snl.com/interactivex/default.aspx", data=login)
print 'l',login_response
for r in login_response.history:
    if r.status_code == 401:  # 401 means authentication failed
        sys.exit(1)  # abort

pdf_response = s.get("http://www.snl.com/interactivex/file.aspx?Id=17670354&KeyFileFormat=PDF")

输出:

2014-06-26 13:04:54,555 DEBUG Added an stderr logging handler to logger: requests.packages.urllib3
2014-06-26 13:04:54,605 INFO Starting new HTTPS connection (1): www.snl.com
2014-06-26 13:04:55,943 DEBUG "GET /interactivex/default.aspx HTTP/1.1" 302 152
2014-06-26 13:04:56,282 DEBUG "GET /interactivex/LoginCookieCheck.aspx HTTP/1.1" 302 143
2014-06-26 13:04:56,650 DEBUG "GET /interactivex/default.aspx HTTP/1.1" 200 None
2014-06-26 13:04:56,865 INFO Starting new HTTP connection (1): www.snl.com
2014-06-26 13:04:57,447 DEBUG "GET /interactivex/file.aspx?Id=17670354&KeyFileFormat=PDF HTTP/1.1" 302 143
2014-06-26 13:04:57,788 DEBUG "GET /InteractiveX/default.aspx HTTP/1.1" 302 162
2014-06-26 13:04:58,151 DEBUG "GET /InteractiveX/default.aspx HTTP/1.1" 200 None

我不知道如何解释这个输出但是当我用Google搜索响应代码200时,我知道这意味着没问题。

但是当我打印pdf_response.text时,它会再次返回登录页面。

0 个答案:

没有答案
相关问题