Question

只需获取回复内容的len()：

>>> response = requests.get('https://github.com/')
>>> len(response.content)
51671

然而，这样做并不能获得准确的内容长度。例如，看看这个python代码：

import sys
import requests

def proccessUrl(url):
    try:
        r = requests.get(url)
        print("Correct Content Length: "+r.headers['Content-Length'])
        print("bytes of r.text       : "+str(sys.getsizeof(r.text)))
        print("bytes of r.content    : "+str(sys.getsizeof(r.content)))
        print("len r.text            : "+str(len(r.text)))
        print("len r.content         : "+str(len(r.content)))
    except Exception as e:
        print(str(e))

#this url contains a content-length header, we will use that to see if the content length we calculate is the same.
proccessUrl("https://stackoverflow.com")

如果我们尝试手动计算内容长度并将其与标题中的内容进行比较，我们会得到更大的答案吗？

Correct Content Length: 51504
bytes of r.text       : 515142
bytes of r.content    : 257623
len r.text            : 257552
len r.content         : 257606

为什么len(r.content)没有返回正确的内容长度？如果缺少标题，我们如何准确地手动计算它？

Answer 1

Content-Length标题反映了响应的正文。这与text或content属性的长度不同，因为响应可能是压缩。 requests会为您解压缩响应。

您必须绕过大量内部管道才能获得原始的，压缩的原始内容，如果您希望response对象仍能正常工作，则必须访问更多内部组件。 “最简单”的方法是启用流式传输，然后从原始套接字读取：

from io import BytesIO

r = requests.get(url, stream=True)
# read directly from the raw urllib3 connection
raw_content = r.raw.read()
content_length = len(raw_content)
# replace the internal file-object to serve the data again
r.raw._fp = BytesIO(raw_content)

演示：

>>> import requests
>>> from io import BytesIO
>>> url = "https://stackoverflow.com"
>>> r = requests.get(url, stream=True)
>>> r.headers['Content-Encoding'] # a compressed response
'gzip'
>>> r.headers['Content-Length']   # the raw response contains 52055 bytes of compressed data
'52055'
>>> r.headers['Content-Type']     # we are served UTF-8 HTML data
'text/html; charset=utf-8'
>>> raw_content = r.raw.read()
>>> len(raw_content)              # the raw content body length
52055
>>> r.raw._fp = BytesIO(raw_content)
>>> len(r.content)    # the decompressed binary content, byte count
258719
>>> len(r.text)       # the Unicode content decoded from UTF-8, character count
258658

这会将完整的响应读入内存，因此如果您希望获得大量响应，请不要使用此响应！在这种情况下，您可以使用shutil.copyfileobj()将数据从r.raw文件复制到spooled temporary file（一旦达到一定的大小，它将切换到磁盘文件），获取该文件的文件大小，然后将该文件填充到r.raw._fp。

为任何缺少该标头的请求添加Content-Type标头的函数如下所示：

import requests
import shutil
import tempfile

def ensure_content_length(
    url, *args, method='GET', session=None, max_size=2**20,  # 1Mb
    **kwargs
):
    kwargs['stream'] = True
    session = session or requests.Session()
    r = session.request(method, url, *args, **kwargs)
    if 'Content-Length' not in r.headers:
        # stream content into a temporary file so we can get the real size
        spool = tempfile.SpooledTemporaryFile(max_size)
        shutil.copyfileobj(r.raw, spool)
        r.headers['Content-Length'] = str(spool.tell())
        spool.seek(0)
        # replace the original socket with our temporary file
        r.raw._fp.close()
        r.raw._fp = spool
    return r

这接受现有会话，并允许您指定请求方法。根据您的内存限制需要调整max_size。 https://github.com上的演示，缺少Content-Length标题：

>>> r = ensure_content_length('https://github.com/')
>>> r
<Response [200]>
>>> r.headers['Content-Length']
'14490'
>>> len(r.content)
54814

请注意，如果没有Content-Encoding标头，或者该标头的值设置为identity，并且Content-Length可用，则只需依靠{{1成为响应的完整大小。那是因为那时显然没有应用压缩。

作为旁注：如果你所追求的是Content-Length或sys.getsizeof()对象的长度（该对象中的字节数或字符数），则不应使用bytes 。 str为您提供Python对象的内部内存占用空间，其中不仅包含该对象中的字节数或字符数。见What is the difference between len() and sys.getsizeof() methods in python?

内容长度标题与手动计算时不一样？

1 个答案: