用于Scrapy的代理池系统暂时停止使用慢速/超时代理

时间:2018-02-21 16:35:16

标签: python proxy scrapy

我一直在四处寻找Scrapy的合适的汇集系统,但我无法找到任何我需要/想要的东西。

我正在寻找解决方案:

旋转代理

  • 我希望他们在代理之间随机切换,但从不连续两次选择相同的代理。 (Scrapoxy有这个)

模仿已知浏览器

  • 模仿Chrome,Firefox,Internet Explorer,Edge,Safari等等(Scrapoxy有此)

黑名单慢代理

  • 如果代理超时或速度很慢,则应通过一系列规则将其列入黑名单......(Scrapoxy仅对实例数/初创公司列入黑名单)

  • 如果代理很慢(接管x时间),则应将其标记为Slow,并且应该采用时间戳并增加计数器。

  • 如果代理超时,则应将其标记为Fail,并且应该采用时间戳并增加计数器。
  • 如果代理在收到最后一个缓慢的15分钟后没有减速,那么计数器&时间戳应归零,代理返回到新状态。
  • 如果代理在收到最后一次失败后30分钟没有失败,那么计数器&时间戳应归零,代理返回到新状态。
  • 如果代理在1小时内缓慢5次,则应将其从池中移除1小时。
  • 如果代理超时1小时5次,则应将其列入黑名单1小时
  • 如果代理人在3小时内被阻止两次,则应将其列入黑名单12小时并标记为错误
  • 如果代理在48小时内被标记为坏两次,那么它应该通知我(电子邮件,推子弹......任何事情)

任何人都知道任何此类解决方案(主要功能是将慢速/超时代理列入黑名单......

1 个答案:

答案 0 :(得分:1)

由于您的投票规则非常具体,您可以编写自己的代码,请参阅下面的代码实现规则的某些部分(您必须实现其他部分):

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import pexpect,time
from random import shuffle

#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
    child = pexpect.spawn("telnet " + ip + " " +str(port))
    time_send_request=time.time()
    try:
        i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
    except pexpect.TIMEOUT:
        i=-1
    if i==0:
        time_request_ok=time.time()
        return {"status":True,"time_to_answer":time_request_ok-time_send_request}
    else:
        return {"status":False,"time_to_answer":max_timeout}


#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
    for i in range(0,len(proxy_list)):
        print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
        proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
        proxy_list[i]["status_ok"]= proxy_status["status"]


        print proxy_status

        #here it is time to treat your own rule to update respective proxy dict

        #~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
        #~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
        #~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
        #~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
        #~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
        #~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)        

        if proxy_status["status"]==True:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
            #...
            pass
        else:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
            #...
            pass        

    return proxy_list


#this func select a good proxy and do the job
def main():

    #first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
    proxy_list=[
        {"ip":"167.99.2.12","port":8080}, #bad proxy
        {"ip":"167.99.2.17","port":8080},
        {"ip":"66.70.160.171","port":1080},
        {"ip":"192.99.220.151","port":8080},
        {"ip":"142.44.137.222","port":80}
        # [...]
    ]



    #this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
    previous_proxy_ip=""

    the_job=True
    while the_job:

        #here we update each proxy status
        proxy_list = update_proxy_list_status(proxy_list)

        #we keep only proxy considered as ok
        good_proxy_list = [d for d in proxy_list if d['status_ok']==True]

        #here you can shuffle the list
        shuffle(good_proxy_list)

        #select a proxy (not same last previous one)
        current_proxy={}
        for i in range(0,len(good_proxy_list)):
            if good_proxy_list[i]["ip"]!=previous_proxy_ip:
                previous_proxy_ip=good_proxy_list[i]["ip"]
                current_proxy=good_proxy_list[i]
                break

        #use this selected proxy to do the job
        print ("the current proxy is: "+str(current_proxy))

        #UPDATE SCRAPY PROXY

        #DO THE SCRAPY JOB
        print "DO MY SCRAPY JOB with the current proxy settings"

        #wait some seconds
        time.sleep(5)

main()