如何从使用Python需要登录信息的网站下载文件?

时间:2014-04-02 05:55:59

标签: python html login web urllib2

我正在尝试使用Python从网站下载一些数据。如果您只是复制并粘贴网址,除非您填写登录信息,否则它不会显示任何内容。我有登录名和密码,但是我应该如何在Python中包含这些?

我目前的代码是:

import urllib, urllib2, cookielib

username = my_user_name
password = my_pwd

link = 'www.google.com' # just for instance
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})

opener.open(link, login_data)
resp = opener.open(link,login_data)
print resp.read()

没有弹出错误,但是resp.read()是一堆CSS,它只有“你必须先登录才能在这里阅读新闻。”

那么如何在登录后检索页面?

注意到该网站需要3个条目:

Company: 

Username: 

Password:

我拥有所有这些但是如何将所有三个放入登录变量?

如果我在没有登录的情况下运行它,则返回:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

opener.open(dd)
resp = opener.open(dd)

print resp.read()

以下是打印件:

<DIV id=header>
<DIV id=strapline><!-- login_display -->
<P><FONT color=#000000>All third party users of this website and/or data produced by the Baltic do so at their own risk. The Baltic owes no duty of care or any other obligation to any party other than the contractual obligations which it owes to its direct contractual partners. </FONT></P><IMG src="images/top-strap.gif"> <!-- template [strapline]--></DIV><!-- end strapline -->
<DIV id=memberNav>
<FORM class=members id=form1 name=form1 action=client_login/client_authorise.asp?action=login method=post onsubmits="return check()">

2 个答案:

答案 0 :(得分:0)

使用scrapy抓取该数据,{{3p>

然后你可以这样做

class LoginSpider(Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'john', 'password': 'secret'},
                    callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

答案 1 :(得分:0)

此代码应该可以使用Python-Requests - 只需将...替换为实际的域名,当然还有登录数据。

from requests import Session

s = Session() # this session will hold the cookies

# here we first login and get our session cookie
s.post("http://.../client_login/client_authorise.asp?action=login", {"companyName":"some_company", "password":"some_password", "username":"some_user", "status":""})

# now we're logged in and can request any page
resp = s.get("http://.../").text

print(resp)