Question

我需要类比：

urllib2.urlopen(url).read(100)

但对于压缩页面，例如：

request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
buf = StringIO(response.read(100))
gzip.GzipFile(fileobj=buf, mode='r').read()

IOError: CRC check failed 0xd71b7369L != 0x0L

Answer 1

请尝试使用zlib库。 Gzip依赖于zlib，但引入了文件级压缩概念以及CRC校验，看起来这不是你想要的。

请参阅这篇优秀的HTTP Compression in python文章（尽管该文章提到不要直接使用zlib，但您应该尝试这两种方法，并根据您特别尝试的内容和最适合<的内容做出决定。你> ）以及这些code snippets from Dough Hellman which show how to compress or decompress with zlib。

一些好的阅读材料：

RFC 1952 - 有关GZIP格式的详细信息
zlib Documentation

Answer 2

我认为不可能这样做，因为您需要完整的GZIP文件（10+字节标题，正文，8字节页脚）来提取它。除非你完全拥有它，否则你无法提取它。正如您的错误消息所解释的那样，CRC检查失败，因为CRC位于页脚中。

Answer 3

您可以要求服务器仅向您发送前100个字节（使用Range标头）：

import urllib2

req=urllib2.Request('http://www.python.org/')
#
# Here we request that bytes 0--100 be downloaded.
# The range is inclusive, and starts at 0.
#
req.add_header('Accept-encoding','gzip')
req.add_header('Range','bytes={}-{}'.format(0, 99))
f=urllib2.urlopen(req)
# This shows you the actual bytes that have been downloaded.
content_range=f.headers.get('Content-Range')
print(content_range)
# bytes 0-99/18716
print(repr(f.read()))
# '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtm'

urllib2加载压缩页面的一部分

3 个答案: