Question

我使用scrapy-splash来构建我的蜘蛛。现在我需要的是维护会话，所以我使用scrapy.downloadermiddlewares.cookies.CookiesMiddleware并处理set-cookie标头。我知道它处理set-cookie标头，因为我设置了COOKIES_DEBUG = True，这导致CookeMiddleware关于set-cookie标头的打印输出。

问题：当我还在图片中添加Splash时，set-cookie打印输出消失了，实际上我得到的响应头是 {＆＃39;日期＆＃39;：[＆＃39; Sun，2016年9月25日12:09:55 GMT＆＃39;]，＆＃39;内容类型＆＃39;：[＆＃39; text / html ; charset = utf-8＆＃39;]，＆＃39; Server＆＃39;：[＆＃39; TwistedWeb / 16.1.1＆＃39;]} 这与使用TwistedWeb的splash渲染引擎有关。

是否有任何指示告诉启动也给我原始的响应标题？

Answer 1

要获得原始回复标题，您可以撰写Splash Lua script;请参阅scrapy-splash自述文件中的examples：

使用Lua脚本获取HTML响应，并将Cookie，标题，正文和方法设置为正确的值; lua_source参数值缓存在Splash服务器上，不会随每个请求一起发送（它需要Splash 2.1 +）：

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):


    # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
            headers={'X-My-Header': 'value'},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

scrapy-splash还提供built-in helpers用于cookie处理;如自述文件中所述，只要scrapy-splash为configured，就会在此示例中启用它们。

scrapy-splash返回自己的标题，而不是网站的原始标题

1 个答案: