Question

我的任务有问题和疑问。我在GruntJs中写了一些应用程序。我必须通过gruntJs下载网页源代码。

例如，我有一个页面：example.com/index.html。

我想在Grunt任务中提供URL，如下所示： scr: "example.com/index.html"。

然后，我必须在文件ex: source.txt中拥有此源代码。

我该怎么做？

Answer 1

有几种方法。

首先是评论中提到的node.js API中的原始http.get。这将为您提供初始加载页面所提供的原始源。当该网站广泛使用javascript在ajax请求之后构建更多html时，问题就出现了。

第二种方法是使用实际的浏览器引擎加载网站并执行任何javascript＆amp;进一步的HTML构建在页面加载上运行。最常见的引擎是PhantomJS，它包含在名为grunt-lib-phantomjs的Grunt库中。

幸运的是，有人在之上提供了另一层，几乎完全符合您的要求： https://github.com/cburgdorf/grunt-html-snapshot

上面链接中的示例配置：

grunt.initConfig({
    htmlSnapshot: {
        all: {
          options: {
            //that's the path where the snapshots should be placed
            //it's empty by default which means they will go into the directory
            //where your Gruntfile.js is placed
            snapshotPath: 'snapshots/',
            //This should be either the base path to your index.html file
            //or your base URL. Currently the task does not use it's own
            //webserver. So if your site needs a webserver to be fully
            //functional configure it here.
            sitePath: 'http://localhost:8888/my-website/',
            //you can choose a prefix for your snapshots
            //by default it's 'snapshot_'
            fileNamePrefix: 'sp_',
            //by default the task waits 500ms before fetching the html.
            //this is to give the page enough time to to assemble itself.
            //if your page needs more time, tweak here.
            msWaitForPages: 1000,
            //if you would rather not keep the script tags in the html snapshots
            //set `removeScripts` to true. It's false by default
            removeScripts: true,
            //he goes the list of all urls that should be fetched
            urls: [
              '',
              '#!/en-gb/showcase'
            ]
          }
        }
    }
});

使用Javascript和grunt下载网页源文件

1 个答案: