与页面连接(错误403)

时间:2018-08-06 17:14:41

标签: python

我无法与页面建立联系。这是我的代码和错误女巫:

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import urllib

someurl = "https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET"
req = Request(someurl)

try:
    response = urllib.request.urlopen(req)
except HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    print("Everything is fine")
  

错误代码:403

2 个答案:

答案 0 :(得分:1)

某些网站需要类似浏览器的“ User-Agent”标头,而其他网站则需要特定的Cookie。在这种情况下,我通过反复试验发现两者都是必需的。您需要做的是:

  1. 使用类似浏览器的用户代理发送初始请求。这将失败并显示403,但是您还将在响应中获得有效的cookie。
  2. 使用相同的用户代理和之前发送的cookie发送第二个请求。

在代码中:

import urllib.request
from urllib.error import URLError

# This handler will store and send cookies for us.
handler = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(handler)
# Browser-like user agent to make the website happy.
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'
request = urllib.request.Request(url, headers=headers)

for i in range(2):
    try:
        response = opener.open(request)
    except URLError as exc:
        print(exc)

print(response)

# Output:
# HTTP Error 403: Forbidden  (expected, first request always fails)
# <http.client.HTTPResponse object at 0x...>  (correct 200 response)

或者,如果愿意,可以使用requests

import requests

session = requests.Session()
jar = requests.cookies.RequestsCookieJar()
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene=MET'

for i in range(2):
    response = session.get(url, cookies=jar, headers=headers)
    print(response)

# Output:
# <Response [403]>
# <Response [200]>

答案 1 :(得分:0)

您可以使用http.client。首先,您需要打开与服务器的连接。然后,发出GET请求。像这样:

import http.client



conn = http.client.HTTPConnection("genecards.org:80")
conn.request("GET", "/cgi-bin/carddisp.pl?gene=MET")

try:
    response = conn.getresponse().read().decode("UTF-8")
except HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    print("Everything is fine")