Question

我正在下载远程文件列表。我的代码如下所示：

try:
    r = requests.get(url, stream=True, verify=False)
    total_length = int(r.headers['Content-Length'])

    if total_length:
        with open(file_name, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()

except (requests.RequestException, StandardError):
    pass

我的问题是请求为不存在的文件（例如404页面或其他类似的HTML页面）下载纯HTML。有没有办法规避这个？要检查的任何标题可能是Content-Type吗？

解决方案：

我根据接受的答案使用了r.raise_for_status()函数调用，还添加了对Content-Type的额外检查，如：

if r.headers['Content-Type'].split('/')[0] == "text":
    #pass/raise here

（此处有MIME类型列表：http://www.freeformatter.com/mime-types-list.html）

Answer 1

使用r.raise_for_status()为4xx和5xx状态代码的响应引发异常，或明确测试r.status_code。

r.raise_for_status()引发HTTPError异常，这是您已经抓住的RequestException的子类：

try:
    r = requests.get(url, stream=True, verify=False)
    r.raise_for_status()  # raises if not a 2xx or 3xx response
    total_length = int(r.headers['Content-Length'])

    if total_length:
        # etc.    
except (requests.RequestException, StandardError):
    pass

r.status_code检查可以让您缩小您认为正确的响应代码的范围。请注意3xx重定向是自动处理的，并且您不会看到其他3xx响应，因为requests在这种情况下不会发送条件请求，因此此处几乎不需要进行显式测试。但如果你这样做，它看起来像是：

r = requests.get(url, stream=True, verify=False)
r.raise_for_status()  # raises if not a 2xx or 3xx response
total_length = int(r.headers['Content-Length'])

if 200 <= r.status_code < 300 and total_length:
    # etc.

Answer 2

if r.status_code == 404:
    handle404()
else:
    download()

如果找不到文件，Python请求会下载HTML

2 个答案: