Question

我希望使用Python（＆gt; 10k）解析大量网页中的数据，我发现我为此编写的函数经常遇到每500次循环的超时错误。我试图用try - except代码块来解决这个问题，但是我想改进这个函数，所以它会在返回错误之前重新尝试打开url四五次。是否有一种优雅的方式来做到这一点？

我的代码如下：

def url_open(url):
    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    try:
        s = urlopen(req,timeout=50).read()
    except urllib.request.HTTPError as e:
        if e.code == 404:
            print(str(e))
        else:
            print(str(e))
            s=urlopen(req,timeout=50).read()
            raise
    return BeautifulSoup(s, "lxml")

Answer 1

我在过去使用了这样的模式进行重试：

def url_open(url):
    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    retrycount = 0
    s = None
    while s is None:
        try:
            s = urlopen(req,timeout=50).read()
        except urllib.request.HTTPError as e:
            print(str(e))
            if canRetry(e.code):
                retrycount+=1
                if retrycount > 5:
                    raise
                # thread.sleep for a bit
            else:
                raise 

    return BeautifulSoup(s, "lxml")

您只需在其他地方定义canRetry。

在超时时重新尝试在python中使用urllib打开url

1 个答案: