我正在从某个https网页下载一些数据 https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html, 所以我因HTTPS而收到此错误。当我手动将网页更改为HTTP时,它可以正常下载。我正在寻找类似的例子来解决这个问题,但我没有发现任何问题。你知道该怎么做吗?
Traceback (most recent call last):
File "down.py", line 34, in <module>
soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")
File "g:\python\Lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "g:\python\Lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "g:\python\Lib\urllib.py", line 443, in open_https
h.endheaders(data)
File "g:\python\Lib\httplib.py", line 1049, in endheaders
self._send_output(message_body)
File "g:\python\Lib\httplib.py", line 893, in _send_output
self.send(msg)
File "g:\python\Lib\httplib.py", line 855, in send
self.connect()
File "g:\python\Lib\httplib.py", line 1274, in connect
server_hostname=server_hostname)
File "g:\python\Lib\ssl.py", line 352, in wrap_socket
_context=self)
File "g:\python\Lib\ssl.py", line 579, in __init__
self.do_handshake()
File "g:\python\Lib\ssl.py", line 808, in do_handshake
self._sslobj.do_handshake()
IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:5
90)
这是我的计划:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.
import re, time, urllib, tinydb
from bs4 import BeautifulSoup
start_time = time.time()
#Open file with urls.
with open("G:/myVE/vacancies/urls2.csv") as f:
lines = f.readlines()
#Open file to write HTML to.
with open("G:/myVE/downloader/urls2_html.txt", 'wb') as g:
#We parse the content of url file to get just urls without the first line and without the text.
for line in lines[1:len(lines)]:
#Read the url from the file.
#url = line.split(",")[0]
url = line
print "test"
#Read the HTML of url
soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")
print url
#Mark of new HTML in HTML file.
g.write("\n\nNEW HTML\n\n")
#Write new HTML to file.
g.write(str(soup))
print "Html saved to html.txt"
print "--- %s seconds ---" % round((time.time() - start_time),2)
"""
#We read HTML of the employment webpage that we intend to parse.
soup = BeautifulSoup(urllib.urlopen('http://www.simplybusiness.co.uk/about-us/careers/jobs/').read(), "html.parser")
#We write HTML to a file.
with open("E:/analitika/SURS/tutorial/tutorial/html.txt", 'wb') as f:
f.write(str(soup))
print "Html saved to html.txt"
print "--- %s seconds ---" % round((time.time() - start_time),2)
"""
谢谢!
答案 0 :(得分:1)
您应该使用requests
库,请参阅http://docs.python-requests.org/en/latest/user/advanced/#ssl-cert-verification作为参考。
已更新以添加
现在,您的网址是requests
库的示例。
import requests
url = "https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html"
r = requests.get(url, verify=True)
print(r.text)
以下是beautifulsoup
和Python 3.3的示例,它似乎也有效。
import urllib
from bs4 import BeautifulSoup
url = "https://www.spar.si/sl_SI/zaposlitev/prosta-delovna-mesta-.html"
soup = BeautifulSoup(urllib.request.urlopen(url).read(), "html.parser")
print(soup)