HTTP标头-请求-Python

时间:2018-08-01 10:05:12

标签: python http web-scraping python-requests

我正在尝试抓取一个网站,其中请求标头具有一些新的属性(对我而言),例如:authority, :method, :path, :scheme

{':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0',GOOGLE_ABUSE_EXEMPTION=ID=0d5af55f1ada3f1e:TM=1533116294:C=r:IP=182.71.238.62-:S=APGng0u2o9IqL5wljH2o67S5Hp3hNcYIpw;1P_JAR=2018-8-1-9',   'upgrade-insecure-requests': '1',   'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36',   'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }

我尝试通过http请求将它们作为标头传递,但最终出现错误,如下所示。

  

ValueError:标头名称b':scheme'无效

对于在传递请求时使用它们的理解和指导,将提供任何帮助。

编辑: 添加代码

import requests

url = 'https://www.google.co.in/search?q=some+text'

headers = {':authority':'xxxx',':method':'GET',':path':'/xxxx',':scheme':'https','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','accept-encoding':'gzip, deflate, br','accept-language':'en-US,en;q=0.9','cache-control':'max-age=0','upgrade-insecure-requests': '1',   'user-agent': 'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.84Safari/537.36',   'x-client-data': 'CJG2yQEIpbbJAQjEtskBCKmdygEI2J3KAQioo8oBCIKkygE=' }

response = requests.get(url, headers=headers)

print(response.text)

2 个答案:

答案 0 :(得分:2)

您的错误来自here(Python的源代码)

如RFC所述,Http标头不能以分号开头。

答案 1 :(得分:1)

:authority,:method,:path,:scheme不是HTTP标头

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

List<WebElement> list = ...
list.stream().filter(a->a.getText() == "a")

定义http请求方法

https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods

':method':'GET'

是URI https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax

的一部分