解析键值对的请求响应

时间:2015-07-08 17:17:58

标签: python python-2.7 python-requests

我将来自POST请求的响应存储到Instagram的API中的文本文件中。此响应中包含的内容是HTML,其中包含我想要挖掘的访问令牌。它是HTML的原因是因为这个POST响应实际上是由最终用户处理的,其中他们点击一个按钮然后提供访问代码。但是我需要在后端执行此操作,因此需要处理HTML响应。

无论如何,到目前为止这里是我的代码(显然这个帖子的真实客户ID被遮盖了):

OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
OAuth_AccessRequest = requests.post(OAuthURL).text 
#print OAuth_AccessRequest

with open('response.txt', 'w') as OAuthResponse:
        OAuthResponse.write(OAuth_AccessRequest.encode("UTF-8"))

OAuthReady = open('response.txt', 'r')
OAuthView = OAuthReady.read()
print OAuthView 

此后我还剩下的是存储在文本文件中的HTML。然而,在HTML中是字典,我需要访问它的值,成对的 - 例如,其中一些,如下所示:

</div> <!-- .root -->

    <script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-shim.min.js></script>
<script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-sham.min.js></script>
<script type="text/javascript">window._sharedData = {"static_root":"\/\/instagramstatic-a.akamaihd.net\/bluebar\/422f3d9","entry_data":{},"hostname":"instagram.com","platform":{"is_touch":false,"app_platform":"web"},"qe":{"su":false},"display_properties_server_guess":{"viewport_width":360,"pixel_ratio":1.5},"country_code":"US","language_code":"en","gatekeepers":{"tr":false},"config":{"dismiss_app_install_banner_until":null,"viewer":null,"csrf_token":"2aedabf96ad1fe86fab0"},"environment_switcher_visible_server_guess":true};</script>

    </body>
</html>

这是数字字符串,是我需要抓取的键“csfr_token”的值。从存储在txt文件中的HTML中挖掘出来的最佳方法是什么?

1 个答案:

答案 0 :(得分:2)

如果csrf_token字符串是整个页面中唯一的字符串,那么使用正则表达式提取它是微不足道的:

import re

token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')

token = token_pattern.search(requests.post(OAuthURL).content).group(1)

请注意,我使用了响应的 content 属性,只需要几个ASCII字符就可以解码整个Unicode响应。

演示:

>>> import requests, re
>>> token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')
>>> OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
>>> token_pattern.search(requests.post(OAuthURL).content).group(1)
'3fd6022ac344c3eaea46e87e258ef9c6'

您可能还想查看响应的标题和Cookie ; CSRF令牌通常也被设置为cookie(或者至少作为会话中的值)。

对于此特定请求,例如,令牌也存储为cookie,与JavaScript块中的值匹配:

>>> r = requests.post(OAuthURL)
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='csrftoken', value='b2b621c198642e26a19fc9bf1b38d246', port=None, port_specified=False, domain='instagram.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1467828030, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
>>> r.cookies['csrftoken']
'b2b621c198642e26a19fc9bf1b38d246'
>>> 'b2b621c198642e26a19fc9bf1b38d246' in r.content
True
>>> token_pattern.search(r.content).group(1)
'b2b621c198642e26a19fc9bf1b38d246'