我一直在尝试创建一个脚本,这样我就能从我们的在线内部网页上获取成绩。
我想要检索的数据是https://sb.stads.ku.dk/SB_PSTA/sb/resultater/studresultater.jsp
我试过在Python中这样做。但每当我登录时,我都不知道如何在脚本中访问此页面。仅仅访问该页面是不够的。好像我在登录后被重定向了。
这是我到目前为止所拥有的。
import urllib2
theurl = 'https://intranet.ku.dk/Selvbetjening/Sider/default.aspx'
username = 'MYUSERNAME'
password = 'MYPASSWORD'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, theurl, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
pagehandle = urllib2.urlopen(theurl)
for elm in pagehandle:
print elm
谢谢!
答案 0 :(得分:0)
每当响应状态为301或302(这意味着重定向)时,我们将在'location'参数中获取重定向的URL。然后使用该URL,您需要再次发出请求。请记住,此网址会指望用户登录,因此您也需要传递所有Cookie。
您实际上在做的是抓取此网站以从中检索数据。您需要采取以下措施:
这是使用httplib的代码片段。
class scraper():
def somefunc(self):
self.host = "intranet.ku.dk"
self.url = "https://intranet.ku.dk/Selvbetjening/Sider/default.aspx"
self.data = urllib.urlencode(postDataDict)
self.headers = { #You can fill these values by looking into what the browser sends.
'Accept': 'text/html; */*',
'Accept-Language': '',
'Accept-Encoding': 'identity',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded'}
response = makeRequest(host,url,data)
if (response.status == 302):
url = '/'+response.getheader("Location").split('/')[3]
response = makeRequest(host,url,{})
def makeRequest(self,host,url,data):
cookies = ''
for key in self.cookies:
cookies = cookies + key + '=' + self.cookies[key] + '; '
self.headers['Cookie'] = cookies
conn = httplib.HTTPSConnection(host)
conn.request("POST", url, data, self.headers)
response = conn.getresponse()
self.saveCookies(response.getheader("Set-Cookie"))
responseVal = response.read()
conn.close()
self.headers['Referer'] = fullUrl #setting header for next request
return response
def saveCookies(self,cookies):
if cookies is not None:
values = cookies.split()
for value in values:
parts = value.split('=')
if(len(parts) > 1):
if((parts[0] != 'expires') and (parts[0] != 'Max-Age') and (parts[0] != 'Path') and (parts[0] != 'path') and (parts[0] != 'Domain')):
self.cookies[parts[0]] = parts[1].rstrip(';')
PS:我修改了我的特定代码以使其一般,因此请检查是否有任何错误。