Question

python是否有任何方法可以将整个html页面及其内容（图像，css）下载到给定URL的本地文件夹。并更新本地html文件以在本地选择内容。

Answer 1

您可以使用urllib模块下载单个网址，但这只会返回数据。它不会解析HTML并自动下载CSS文件和图像等内容。

如果要下载“整个”页面，则需要解析HTML并找到需要下载的其他内容。您可以使用Beautiful Soup之类的内容来解析您检索的HTML。

This question有一些示例代码就是这样做的。

Answer 2

您正在寻找的是一种镜像工具。如果你想在Python中使用一个，PyPI列出spider.py，但我没有经验。其他人可能会更好，但我不知道 - 我使用'wget'，它支持getting the CSS和图像。这可能会做你想要的（引自the manual）

只检索一个HTML页面，但是make 确保所需的所有元素要显示的页面，例如内嵌图像和外部风格床单，也下载。也做确保下载的页面引用下载的链接。

wget -p --convert-links http://www.server.com/dir/page.html

Answer 3

您可以使用urlib：

import urllib.request

opener = urllib.request.FancyURLopener({})
url = "http://stackoverflow.com/"
f = opener.open(url)
content = f.read()

Answer 4

下面的功能`savePage` 可以：

将.html保存在当前文件夹中
根据标签javascripts，css和images下载script，link和img。
- 保存在后缀为_files的文件夹中。
任何异常都打印在sys.stderr上
- 返回一个BeautifulSoup对象

使用Python 3 +， Requests ， BeautifulSoup 和其他标准库。

函数savePage接收到url和filename的保存位置。

您可以根据自己的需要进行扩展/调整

import os, sys
import requests
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
import re

def savePage(url, pagefilename='page'):
    def soupfindnSave(pagefolder, tag2find='img', inner='src'):
        """saves on specified `pagefolder` all tag2find objects"""
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag2find):   # images, css, etc..
            try:         
                if not res.has_attr(inner): # check if inner tag (file object) exists
                    continue # may or may not exist
                filename = re.sub('\W+', '', os.path.basename(res[inner])) # clean special chars
                fileurl = urljoin(url, res.get(inner))
                filepath = os.path.join(pagefolder, filename)
                # rename html ref so can move html and folder of files anywhere
                res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                if not os.path.isfile(filepath): # was not downloaded
                    with open(filepath, 'wb') as file:
                        filebin = session.get(fileurl)
                        file.write(filebin.content)
            except Exception as exc:
                print(exc, file=sys.stderr)
        return soup
    
    session = requests.Session()
    #... whatever other requests config you need here
    response = session.get(url)
    soup = BeautifulSoup(response.text, features="lxml")
    pagefolder = pagefilename+'_files' # page contents
    soup = soupfindnSave(pagefolder, 'img', 'src')
    soup = soupfindnSave(pagefolder, 'link', 'href')
    soup = soupfindnSave(pagefolder, 'script', 'src')
    with open(pagefilename+'.html', 'wb') as file:
        file.write(soup.prettify('utf-8'))
    return soup

示例将google.com保存为google.html，并将内容保存在google_files文件夹中。（当前文件夹）

soup = savePage('https://www.google.com', 'google')

下载html页面及其内容

4 个答案:

下面的功能`savePage` 可以：

您可以根据自己的需要进行扩展/调整

下载html页面及其内容

4 个答案:

下面的功能savePage 可以：

您可以根据自己的需要进行扩展/调整

下面的功能`savePage` 可以：