Question

在我的CustomDownloaderMiddleware中：

    def process_request(self, request, spider):
        if spider.name == 'UrlSpider':
            res = requests.get(request.url)
            return HtmlResponse(request.url, body=res.content, encoding='utf-8', request=request)

我想在def process_response中渲染 reponse.body ，我该怎么办？

Answer 1

有一个scrapy中间件可以完成这个：它将通过PhantomJS运行你的请求，你的响应将包含渲染的html。

你在这里找到它并且对我来说效果很好（虽然根据其作者的测试不是很好）：https://github.com/brandicted/scrapy-webdriver

如果你没有受到PhantomJS的约束，你也可以看一下https://github.com/scrapy-plugins/scrapy-splash，因为这样可以更好地维护（由开发scrapy的人）。

<强>更新

如果你想通过PhantomJS只抓一些页面，我会看到两种可能的方法：

最有可能做一些Javascript魔术，将response.body中的html注入PhantomJS并使其渲染此页面。

这正是你想要的，但要做到这一点可能有点困难。（一直在用PhantomJS做一些Javascript魔术，而且它通常不像我希望的那么容易）。

您可以将PhantomJS下载程序与标准中间件并行注册，然后再次加载要渲染的页面，但这次是通过PhantomJS下载程序。

为此，请在settings.py：

# note the 'js-' in front of http
DOWNLOAD_HANDLERS = {
    'js-http': 'scrapy_webdriver.download.WebdriverDownloadHandler',
    'js-https': 'scrapy_webdriver.download.WebdriverDownloadHandler',
}

然后在你的解析方法中：

def parse(self, response):
    if should_be_rendered(response):
        phantom_url = response.url.replace("http", "js-http")
        # do the same request again but this time through the WebdriverDownloadHandler
        yield Request(phantom_url, ...)

scrapy是否有可能直接使用Phantomjs下载页面源进行渲染？

1 个答案: