用scrapy抓取非拉丁域名

时间:2016-01-01 16:23:13

标签: python url scrapy httprequest scrapy-spider

我需要使用scrapy抓取“.рф”域区域中的某些网站。网址结构如下:“http://сайтдляпримера.рф”(此网址不是真实的,例如,它是给出的)。当然,我尝试使用的网站可以通过浏览器访问。 我尝试使用start_urls属性开始抓取,例如:

start_urls = ['http://сайтдляпримера.рф']

还有start_requests功能:

def start_requests(self):
    return [scrapy.Request("http://сайтдляпримера.рф/", callback=self._test)]

他们都没有按预期工作,我得到了以下控制台消息:

2016-01-01 19:02:01 [scrapy] INFO: Spider opened
2016-01-01 19:02:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-01 19:02:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 1 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 2 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] DEBUG: Gave up retrying <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84> (failed 3 times): DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] ERROR: Error downloading <GET http://%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84>: DNS lookup failed: address '%D1%81%D0%B0%D0%B9%D1%82%D0%B4%D0%BB%D1%8F%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80%D0%B0.%D1%80%D1%84' not found: [Errno -2] Name or service not known.
2016-01-01 19:02:01 [scrapy] INFO: Closing spider (finished)

*如果重要,我需要在基于Linux的操作系统上使用scrapy。

有什么解决方案吗?如果可能的话,我可以从_spider文件中解决这个问题,因为我无法访问框架的存储库(处理http请求的任何内容都没有在那里修改)

1 个答案:

答案 0 :(得分:2)

在处理国际化域名(IDN)时,您需要使用idna对非ascii网址进行编码。您需要将结果字节解码为unicode字符串。另请注意,构成协议名称(&#39; http://&#39;)的网址的ascii子字符串应单独添加前缀,以便在执行idna编码时不会搞砸:< / p>

'http://' + u'сайтдляпримера.рф'.encode('idna').decode('utf-8')

有关详细信息,另请参阅this document