Question

我正在使用urllib.request包来打开和阅读网页。我想确保我的代码能很好地处理重定向。现在，当我看到重定向时，我只是失败了（这是一个HTTPError）。有人可以指导我如何处理它？我的代码目前看起来像：

try:
        text = str(urllib.request.urlopen(url, timeout=10).read())
except ValueError as error:
        print(error)
except urllib.error.HTTPError as error:
        print(error)
except urllib.error.URLError as error:
        print(error)
except timeout as error:
        print(error)

请帮助我，我是新手。谢谢！

Answer 1

我使用特殊的URLopener来捕获重定向：

import urllib

class RedirectException(Exception):
    def __init__(self, errcode, newurl):
        Exception.__init__(self)
        self.errcode = errcode
        self.newurl = newurl

class MyURLopener(urllib.URLopener):
    # Error 301 -- relocated (permanently)
    def http_error_301(self, url, fp, errcode, errmsg, headers, data=None):
        if headers.has_key('location'):
            newurl = headers['location']
        elif headers.has_key('uri'):
            newurl = headers['uri']
        else:
            newurl = "Nowhere"
        raise RedirectException(errcode, newurl)

    # Error 302 -- relocated (temporarily)
    http_error_302 = http_error_301
    # Error 303 -- relocated (see other)
    http_error_303 = http_error_301
    # Error 307 -- relocated (temporarily)
    http_error_307 = http_error_301

urllib._urlopener = MyURLopener()

现在我需要捕获RedirectException并且瞧 - 我知道有重定向，我知道URL。警告 - 我在Python 2.7中使用代码，不知道它是否适用于Python 3.

Answer 2

使用requests包我能找到更好的解决方案。您需要处理的唯一例外是：

 try:
        r = requests.get(url, timeout =5)

except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop

except requests.exceptions.TooManyRedirects as error:
# Tell the user their URL was bad and try a different one

except requests.exceptions.ConnectionError:
# Connection could not be completed

except requests.exceptions.RequestException as e:
# catastrophic error. bail.

要获取该页面的文本，您需要做的就是： r.text

重定向处理程序python 3.4.3

2 个答案: