在scrapy中使用try / except子句无法获得预期的结果

时间:2019-04-29 16:36:51

标签: python python-3.x web-scraping scrapy

我已经在scrapy中编写了一个脚本,以通过get_proxies()方法使用新生成的代理来发出代理请求。我使用requests模块来获取代理,以便在脚本中重用它们。我想做的是从landing page解析所有电影链接,然后从target page获取每部电影的名称。我下面的脚本可以使用代理旋转。

我知道有一种更简单的方法来更改代理,如此处HttpProxyMiddleware所述,但是我仍然想坚持我在这里尝试的方法。

website link

这是我目前的尝试(它不断使用新的代理来获取有效的响应,但是每次获得503 Service Unavailable时):

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_proxies():   
    response = requests.get("https://www.us-proxy.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
    return proxy

class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    handle_httpstatus_list = [503]
    proxy_vault = get_proxies()
    check_url = "https://yts.am/browse-movies"

    def start_requests(self):
        random.shuffle(self.proxy_vault)
        proxy_url = next(cycle(self.proxy_vault))
        request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
        request.meta['https_proxy'] = f'http://{proxy_url}'
        yield request

    def parse(self,response):
        print(response.meta)
        if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
            random.shuffle(self.proxy_vault)
            proxy_url = next(cycle(self.proxy_vault))
            request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
            request.meta['https_proxy'] = f'http://{proxy_url}'
            yield request

        else:
            for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
                nlink = response.urljoin(item)
                yield scrapy.Request(nlink,callback=self.parse_details)

    def parse_details(self,response):
        name = response.css("#movie-info h1::text").get()
        yield {"Name":name}

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()

为确保请求是否被代理,我打印了response.meta,并可以得到类似{'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}的结果。

由于我过度使用了链接以检查scrapy中的代理请求如何工作,因此我现在遇到503 Service Unavailable错误,并且可以在响应DDoS protection by Cloudflare中看到此关键字。但是,当我尝试使用requests模块应用与此处实现的相同逻辑时,会得到有效的响应。

My earlier question: why I can't get the valid response as (I suppose) I'm using proxies in the right way? [solved]

  

赏金问题:如何在脚本中定义try/except子句,以便一旦它与某个代理发生连接错误,它将尝试使用不同的代理?

2 个答案:

答案 0 :(得分:3)

根据scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs (和source
应该使用proxy元密钥(而不是https_proxy

#request.meta['https_proxy'] = f'http://{proxy_url}'  
request.meta['proxy'] = f'http://{proxy_url}'

因为scrapy没有收到有效的元密钥-您的scrapy应用程序没有使用代理

答案 1 :(得分:2)

start_requests()函数只是入口点。在随后的请求中,您将需要将此元数据重新提供给Request对象。

此外,错误可能会发生在两个级别上:代理服务器和目标服务器

我们需要处理来自代理服务器和目标服务器的错误响应代码。代理错误由中间件软件返回到errback函数。可以在从response.status

进行解析的过程中处理目标服务器的响应
import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess


def get_proxies():
    response = requests.get("https://www.us-proxy.org/")
    soup = BeautifulSoup(response.text, "lxml")
    proxy = [':'.join([item.select_one("td").text, item.select_one("td:nth-of-type(2)").text]) for item in
             soup.select("table.table tbody tr") if "yes" in item.text]
    # proxy = ['https://52.0.0.1:8090', 'https://52.0.0.2:8090']
    return proxy


def get_random_proxy(proxy_vault):
    random.shuffle(proxy_vault)
    proxy_url = next(cycle(proxy_vault))
    return proxy_url


class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    handle_httpstatus_list = [503, 502, 401, 403]
    check_url = "https://yts.am/browse-movies"
    proxy_vault = get_proxies()

    def handle_middleware_errors(self, *args, **kwargs):
        # implement middleware error handling here
        print('Middleware Error')
        # retry request with different proxy
        yield self.make_request(url=args[0].request._url, callback=args[0].request._meta['callback'])

    def start_requests(self):
        yield self.make_request(url=self.check_url, callback=self.parse)

    def make_request(self, url, callback, dont_filter=True):
        return scrapy.Request(url,
                              meta={'proxy': f'https://{get_random_proxy(self.proxy_vault)}', 'callback': callback},
                              callback=callback,
                              dont_filter=dont_filter,
                              errback=self.handle_middleware_errors)

    def parse(self, response):
        print(response.meta)
        try:
            if response.status != 200:
                # implement server status code handling here - this loops forever
                print(f'Status code: {response.status}')
                raise
            else:
                for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
                    nlink = response.urljoin(item)
                    yield self.make_request(url=nlink, callback=self.parse_details)
        except:
            # if anything goes wrong fetching the lister page, try again
            yield self.make_request(url=self.check_url, callback=self.parse)

    def parse_details(self, response):
        print(response.meta)
        try:
            if response.status != 200:
                # implement server status code handeling here - this loops forever
                print(f'Status code: {response.status}')
                raise
            name = response.css("#movie-info h1::text").get()
            yield {"Name": name}
        except:
            # if anything goes wrong fetching the detail page, try again
            yield self.make_request(url=response.request._url, callback=self.parse_details)


if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()
相关问题