Question

我想检查某个网站是否存在，这就是我正在做的事情：

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com"
req = urllib2.Request(link, headers = headers)
page = urllib2.urlopen(req).read() - ERROR 402 generated here!

如果页面不存在（错误402或其他任何错误），我可以在page = ...行中做些什么来确保我正在阅读的页面确实退出？

Answer 1

您可以使用HEAD请求而不是GET。它只会下载标题，但不会下载内容。然后，您可以从标题中检查响应状态。

import httplib
c = httplib.HTTPConnection('www.example.com')
c.request("HEAD", '')
if c.getresponse().status == 200:
   print('web site exists')

或者您可以使用urllib2

import urllib2
try:
    urllib2.urlopen('http://www.example.com/some_page')
except urllib2.HTTPError, e:
    print(e.code)
except urllib2.URLError, e:
    print(e.args)

或者您可以使用requests

import requests
request = requests.get('http://www.example.com')
if request.status_code == 200:
    print('Web site exists')
else:
    print('Web site does not exist')

Answer 2

最好检查状态代码是否为＆lt; 400，就像完成here一样。以下是状态代码的含义（取自wikipedia）：

1xx - 信息
2xx - 成功
3xx - 重定向
4xx - 客户端错误
5xx - 服务器错误

如果您想检查页面是否存在而又不想下载整个页面，则应使用Head Request：

import httplib2
h = httplib2.Http()
resp = h.request("http://www.google.com", 'HEAD')
assert int(resp[0]['status']) < 400

取自this answer。

如果要下载整个页面，只需发出正常请求并检查状态代码。使用requests的示例：

import requests

response = requests.get('http://google.com')
assert response.status_code < 400

另见类似主题：

希望有所帮助。

Answer 3

from urllib2 import Request, urlopen, HTTPError, URLError

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://www.abc.com/"
req = Request(link, headers = headers)
try:
        page_open = urlopen(req)
except HTTPError, e:
        print e.code
except URLError, e:
        print e.reason
else:
        print 'ok'

回答unutbu的评论：

由于默认处理程序处理重定向（300范围内的代码），并且100-299范围内的代码表示成功，因此通常只能看到400-599范围内的错误代码。 Source

Answer 4

<强>码

a="http://www.example.com"
try:    
    print urllib.urlopen(a)
except:
    print a+"  site does not exist"

Answer 5

@AdemÖztaş提供了一个很好的答案，可以与httplib和urllib2一起使用。对于requests，如果问题仅是关于资源存在的问题，则在存在大量资源的情况下可以改善答案。

先前对requests的回答建议如下：

def uri_exists_get(uri: str) -> bool:
    try:
        response = requests.get(uri)
        try:
            response.raise_for_status()
            return True
        except requests.exceptions.HTTPError:
            return False
    except requests.exceptions.ConnectionError:
        return False

requests.get尝试一次提取整个资源，因此对于大型媒体文件，以上代码片段将尝试将整个媒体提取到内存中。为了解决这个问题，我们可以流式传输响应。

def uri_exists_stream(uri: str) -> bool:
    try:
        with requests.get(uri, stream=True) as response:
            try:
                response.raise_for_status()
                return True
            except requests.exceptions.HTTPError:
                return False
    except requests.exceptions.ConnectionError:
        return False

我用两个Web资源附带的计时器运行了以上代码片段：

1）http://bbb3d.renderfarming.net/download.html，非常轻的html页面

2）http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4，大小合适的视频文件

以下计时结果：

uri_exists_get("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.611239

uri_exists_stream("http://bbb3d.renderfarming.net/download.html")
# Completed in: 0:00:00.000007

uri_exists_get("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:01:12.813224

uri_exists_stream("http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_30fps_normal.mp4")
# Completed in: 0:00:00.000007

最后一点：在资源主机不存在的情况下，此功能也起作用。例如，"http://abcdefghblahblah.com/test.mp4"将返回False。

Answer 6

您可以简单地使用stream方法来不下载完整文件。与最新的Python3一样，您不会获得urllib2。最好使用经过验证的请求方法。这个简单的功能将解决您的问题。

def uri_exists(uri):
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        return True
    else:
        return False

Answer 7

def isok(mypath):
    try:
        thepage = urllib.request.urlopen(mypath)
    except HTTPError as e:
        return 0
    except URLError as e:
        return 0
    else:
        return 1

Answer 8

尝试以下方法：：

import urllib2  
website='https://www.allyourmusic.com'  
try:  
    response = urllib2.urlopen(website)  
    if response.code==200:  
        print("site exists!")  
    else:  
        print("site doesn't exists!")  
except urllib2.HTTPError, e:  
    print(e.code)  
except urllib2.URLError, e:  
    print(e.args)

Python检查网站是否存在

8 个答案: