Question

我正在使用scrapy来抓取几个网站。我的蜘蛛不允许跳过域名。在这种情况下，重定向使爬虫立即停止。在大多数情况下，我知道如何处理它，但这是一个奇怪的。

罪魁祸首是：http://www.cantonsd.org/

我用http://www.wheregoes.com/检查了其重定向模式，它告诉我重定向到＆＃34; /＆＃34;。这可以防止蜘蛛进入其parse功能。我怎么处理这个？

编辑：代码。

我使用scrapy提供的API调用蜘蛛：http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script 唯一的区别是我的蜘蛛是自定义的。它创建如下：

spider = DomainSimpleSpider(
   start_urls = [start_url],
   allowed_domains = [allowed_domain],
   url_id = url_id,
   cur_state = cur_state,
   state_id_url_map = id_url,
   allow = re.compile(r".*%s.*" % re.escape(allowed_path), re.IGNORECASE),
   tags = ('a', 'area', 'frame'),
   attrs = ('href', 'src'),
   response_type_whitelist = [r"text/html", r"application/xhtml+xml", r"application/xml"],
   state_abbr = state_abbrs[cur_state]
)

我认为问题在于allowed_domains发现/不是列表的一部分（仅包含cantonsd.org）并关闭所有内容。

我没有报告完整的蜘蛛代码，因为它根本没有被调用，所以它不会成为问题。

如何抓取重定向到“/”的网站

0 个答案: