我已经看了一些教程,并阅读了一本关于beautifulsoup基础知识的书,写了这个刮刀,但不能让它循环通过网址a-z或浏览页面。对于这个项目,我正在抓一个网站,我希望能够让它刮A-Z而不仅仅是A页的结果。
下面的代码工作,直到我试图让它生成最后一个字母字符串 -
以下是我的工作代码 - 我尝试构建url字符串。理想情况下,我也希望从文件或预定义列表中提取,但是小步骤。
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
for letter in ascii_lowercase:
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
for record in soup.find_all("tr"):
playerdata=""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Player,From,To,Pos,Ht,Wt,Birth Date,College"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))
print(letter)
print(playerdatasaved)
我的错误如下 ---------------------
Traceback (most recent call last):
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 15, in <module>
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 8, in make_soup
thepage = urllib.request.urlopen(url)
File "C:\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36\lib\urllib\request.py", line 564, in error
result = self._call_chain(*args)
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36\lib\urllib\request.py", line 756, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
有人给我一些帮助或建议吗?
下面是一个页面的工作版本 - 我需要它来抓取多个。
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
soup = make_soup("http://www.basketball-reference.com/players/a/")
for record in soup.find_all("tr"):
playerdata = ""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Player,From,To,Pos,Ht,Wt,Birth Date,College"+"\n"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))
print(playerdatasaved)
答案 0 :(得分:0)
该特定网站没有“x”。因此,你将获得一个404.尝试包装它,除非它将跳过404页面,它应该工作。
playerdatasaved=""
for letter in ascii_lowercase:
try:
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
for record in soup.find_all("tr"):
playerdata=""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
except:
pass