urllib2 http状态不适用于某些链接

时间:2018-11-02 14:14:24

标签: python urllib2

我已经使用这个模块(urllib2)玩了一段时间了。最近,我设法制作了一个简单的HTTP状态检查器,用于检查给定列表的每个URL的已接收状态代码,如果没有给出200好的代码,则将其删除。

代码如下:

 for p in urllist:
    req = urllib2.Request(p)
    try:
        resp = urllib2.urlopen(req)
    except urllib2.HTTPError as e:
        if e.code == 404:
            print str(p)+ " returns 404 error (Not found). This URL will be removed from the list"
            urllist.remove(p)
        elif e.code == 400 or e.code == 401 or e.code == 403:
            print str(p) + " returns a 400 error (Bad request) or 401/403 error (Unauthorized/forbidden) This URL will be removed fromt the list"
            urllist.remove(p)
        elif e.code == 408:
            print str (p) + " returned a 408 error (request timeout) This URL may or may not be available soon, this URL will be kept in the list"
        elif e.code == 429:
            print str(p) + " returned a 429 error (too many requests). The script may have reached a request limit, abort and try again later"           
        elif 500 <= e.code <= 511:
            print str(p) + " returned a 5xx error (server error). servers may be unavailable at the moment. Please abort and try again later"
        elif 410 <= e.code <= 451 or ecode > 511:
            print str(p) + " has returned an unespecified http error. This URL will be removed from the list"
            urllist.remove(p)

    except urllib2.URLError as e:
         print str(p) + " returned an unespecified error. This URL will be removed from the list"
         urllist.remove(p)
    else:
        # 200
        body = resp.read()
        print str(p) + " returns a 200 status code (Ok). This URL exists."   

原始代码来自this post

我使用bit.ly url对此进行了测试,这些URL很简单,而且不会很乏味地放入列表中。它们中的大多数都按预期返回一个或另一个http状态代码。但是其中一些仅持续3倍多时间被脚本接受/删除,一个示例是bit / 1 / 1da2,在输入时会弹出警告。

我检查了各种生成的链接列表,该脚本唯一的问题是带有警告它们的URL。它尝试获取大约2分钟的http状态代码? (我尚未计时),然后跳转到列表中的下一个URL,而不从列表中删除该链接。

我认为可以在此脚本的URLError部分解决此问题,但我不确定。

0 个答案:

没有答案