Question

我有一个带有Init功能的蜘蛛：

class ExpireddomainsSpider(InitSpider):
    name = "expiredomains"

    def __init__(self,typ=None, *args, **kwargs):
        super(ExpireddomainsSpider, self).__init__(*args, **kwargs)
        self.typ = typ

        users=open("users.txt","r")
        dane = self.random_line(users)
        dane = dane.split(':')
        self.user = dane[0]
        self.password = dane[1]
        self.ip = dane[2]
        self.port = dane[3]
        self.headers = {"User-Agent": dane[4]}

它需要tex文件中的随机行，我有用户登录，通过等等。然后我有登录功能：

def login(self, response):
        self.log("USER: "+self.user+" PASS: "+self.password)
        return FormRequest('https://member.expireddomains.net/login/',
                    formdata={'login': self.user, 'password': self.password},
                    callback=self.check_login_response, method='POST')

检查登录状态的功能：

def check_login_response(self, response):
    sprawdz = self.user.title()
    if sprawdz in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        return scrapy.FormRequest('http://expireddomains.net/', callback=self.start_crawl)
    elif "Your account was disabled" in response.body:
        self.log("Your account was disabled!")
        super(ExpireddomainsSpider, self).__init__()
    else:
        self.log("Bad times :(")

如果登录失败，我想重新启动我的蜘蛛。因此，蜘蛛会再次向用户打开文件并获得另一条随机行，然后再试一次。

我试过了：

super(ExpireddomainsSpider, self).__init__()

但它不起作用，蜘蛛关闭了。

编辑：

好的，现在我有了这个：

class ExpireddomainsSpider(InitSpider):
    name = "expiredomains"

    def init(self):
        users=open("users.txt","r")
        dane = self.random_line(users)
        dane = dane.split(':')
        self.user = dane[0]
        self.password = dane[1]
        self.ip = dane[2]
        self.port = dane[3]
        self.headers = {"User-Agent": dane[4]}

    def __init__(self,typ=None, *args, **kwargs):
        super(ExpireddomainsSpider, self).__init__(*args, **kwargs)
        self.typ = typ
        self.init()

和

def check_login_response(self, response):
    sprawdz = self.user.title()
    if sprawdz in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        return scrapy.FormRequest('http://expireddomains.net/', callback=self.start_crawl)
    elif "Your account was disabled" in response.body:
        self.log("Your account was disabled!")
        self.init()
        return self.login(response)
    else:
        self.log("Bad times :(")

但它只能工作两次 - 它会变成随机行，如果失败则尝试登录然后再次获取随机行尝试登录如果失败蜘蛛关闭。在登录成功之前不会尝试。

解：好的，我已经解决了。我需要在我的登录功能中添加：dont_filter = True：

def login(self, response):
        return FormRequest('https://member.expireddomains.net/login/',
                    formdata={'login': self.user, 'password': self.password},
                    callback=self.check_login_response, method='POST',dont_filter=True)

Answer 1

您可以将初始化代码移出到自己的_init方法中，然后再次调用self.login。我会改变check_login_response，如：

def check_login_response(self, response):
    sprawdz = self.user.title()
    if sprawdz in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        return scrapy.FormRequest('http://expireddomains.net/', callback=self.start_crawl)
    elif "Your account was disabled" in response.body:
        self.log("Your account was disabled!")
        self.init()
        return self.login(response)
    else:
        self.log("Bad times :(")

登录失败后重新启动scrapy spider

1 个答案: