无法通过python脚本登录https网站(https://malwr.com)

时间:2014-12-30 06:21:27

标签: python python-2.7 https web-scraping

我需要通过python脚本登录malwr site 我尝试了各种模块,例如machanize modulerequest module,但是使用scrpt登录网站没有成功。

我想创建自动化脚本,通过解析html页面从malware analysis site下载文件,但是由于登录问题,我无法解析html页面的href属性来获取下载文件的链接。

以下是我的代码:

import urllib, urllib2, cookielib

username = 'myuser'
password = 'mypassword'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
opener.open('https://malwr.com/account/login/', login_data)
resp = opener.open('https://malwr.com/analysis/MDMxMmY0NjMzNjYyNDIyNDkzZTllOGVkOTc5ZTQ5NWU/')
print resp.read()
我做错了吗?

1 个答案:

答案 0 :(得分:2)

要做的关键是解析表单中的csrf令牌,并将其与POST参数中的usernamepassword一起传递给https://malwr.com/account/login/端点。

以下是使用requestsBeautifulSoup库的解决方案。

首先,它会打开一个会话来维护Cookie,以便"保持登录状态"通过网络抓取会话,它将从登录页面获取csrf令牌。下一步是发送POST请求登录。然后,您可以打开"分析"页面并检索链接:

from urlparse import urljoin
from bs4 import BeautifulSoup
import requests

base_url = 'https://malwr.com/'
url = 'https://malwr.com/account/login/'
username = 'username'
password = 'password'

session = requests.Session()

# getting csrf value
response = session.get(url)
soup = BeautifulSoup(response.content)

form = soup.form
csrf = form.find('input', attrs={'name': 'csrfmiddlewaretoken'}).get('value')

# logging in
data = {
    'username': username,
    'password': password,
    'csrfmiddlewaretoken': csrf
}
session.post(url, data=data)

# getting analysis data
response = session.get('https://malwr.com/analysis/MDMxMmY0NjMzNjYyNDIyNDkzZTllOGVkOTc5ZTQ5NWU/')
soup = BeautifulSoup(response.content)

link = soup.find('section', id='file').find('table')('tr')[-1].a.get('href')
link = urljoin(base_url, link)
print link

打印:

https://malwr.com/analysis/file/MDMxMmY0NjMzNjYyNDIyNDkzZTllOGVkOTc5ZTQ5NWU/sample/7fe8157c0aa251b37713cf2dc0213a3ca99551e41fb9741598eb75c294d1537c/