用代理运行刮擦飞溅

时间:2018-11-03 15:10:10

标签: proxy scrapy splash

我正在使用代理服务器,但是我一直都得到502个代理服务器,这困扰了我好几天。

我的下载中间件:

class ABProxyMiddleware(HttpProxyMiddleware):
""" 阿布云ip代理配置 """
proxyAuth = "Basic " + base64.urlsafe_b64encode(
    bytes((settings['PROXY_USER'] + ":" + settings['PROXY_PASS']), "ascii")).decode("utf-8")

def process_request(self, request, spider):
    request.meta['splash']['args']['proxy'] = settings['PROXY_SERVER']
    request.headers['Proxy-Authorization'] = self.proxyAuth

我的要求:

yield SplashRequest(url= 'http://www.qidian.com/all?chanId=4&subCateId=130&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=' + str(
                i),callback=self.book_parse, endpoint='render.html')

我的设置

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'tempScrapy.middlewares.ABProxyMiddleware': 100,

}

我确信关于代理的所有设置都是正确的,并且代理是有效的,因为它将成功,并且不会出现飞溅

1 个答案:

答案 0 :(得分:0)

根据您的代码,您正在将代理身份验证标头发送到Splash服务器:

+-------------+
| Your spider |
+------+------+
       |
       | Proxy Authentication
       v
+------+-------+
|   Splash     |
+------+-------+
       |
       |
       v
+------+-------+
| Proxy server |
+------+-------+
       |
       |
       v
+------+-------+
| Target site  |
+--------------+

Splash服务器将仅忽略您发送的代理身份验证标头,因此,由于身份验证不成功,代理服务器将拒绝您的请求。

正确的做法是让Splash发送代理身份验证标头:

+-------------+
| Your spider |
+------+------+
       |
       |
       v
+------+-------+
|   Splash     |
+------+-------+
       |
       | Proxy Authentication
       v
+------+-------+
| Proxy server |
+------+-------+
       |
       |
       v
+------+-------+
| Target site  |
+--------------+

因此,您需要删除以下行:

request.headers['Proxy-Authorization'] = self.proxyAuth

并正确配置代理信息:

request.meta['splash']['args']['proxy'] = 'proxy info of format: [protocol://][user:password@]proxyhost[:port]'

另请参阅:API reference of Splash(寻找proxy参数)