我如何在beautifulsoup中生成url字符串

时间:2017-02-13 22:32:29

标签: python beautifulsoup

我已经看了一些教程,并阅读了一本关于beautifulsoup基础知识的书,写了这个刮刀,但不能让它循环通过网址a-z或浏览页面。对于这个项目,我正在抓一个网站,我希望能够让它刮A-Z而不仅仅是A页的结果。

下面的代码工作,直到我试图让它生成最后一个字母字符串 -

以下是我的工作代码 - 我尝试构建url字符串。理想情况下,我也希望从文件或预定义列表中提取,但是小步骤。

import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
from string import ascii_lowercase

def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata


playerdatasaved=""
for letter in ascii_lowercase:
soup = make_soup("http://www.basketball-reference.com/players/" + letter +      "/")
  for record in soup.find_all("tr"):
    playerdata=""
    for data in record.findAll("td"):
        playerdata=playerdata+","+data.text
    if len(playerdata)!=0:
        playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

header="Player,From,To,Pos,Ht,Wt,Birth Date,College"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))

print(letter)
print(playerdatasaved)

我的错误如下 ---------------------

Traceback (most recent call last):
 File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 15, in <module>
 soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
 File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 8, in    make_soup
   thepage = urllib.request.urlopen(url)
 File "C:\Python36\lib\urllib\request.py", line 223, in urlopen
  return opener.open(url, data, timeout)
  File "C:\Python36\lib\urllib\request.py", line 532, in open
  response = meth(req, response)
 File "C:\Python36\lib\urllib\request.py", line 642, in http_response
 'http', request, response, code, msg, hdrs)
 File "C:\Python36\lib\urllib\request.py", line 564, in error
   result = self._call_chain(*args)
 File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
 result = func(*args)
 File "C:\Python36\lib\urllib\request.py", line 756, in http_error_302
 return self.parent.open(new, timeout=req.timeout)
 File "C:\Python36\lib\urllib\request.py", line 532, in open
 response = meth(req, response)
 File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
 File "C:\Python36\lib\urllib\request.py", line 570, in error
 return self._call_chain(*args)
 File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
 result = func(*args)
 File "C:\Python36\lib\urllib\request.py", line 650, in http_error_default
 raise HTTPError(req.full_url, code, msg, hdrs, fp)
 urllib.error.HTTPError: HTTP Error 404: Not Found

有人给我一些帮助或建议吗?

下面是一个页面的工作版本 - 我需要它来抓取多个。

import urllib
import urllib.request
from bs4 import BeautifulSoup
import os

def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata

playerdatasaved=""
soup = make_soup("http://www.basketball-reference.com/players/a/")
for record in soup.find_all("tr"):
playerdata = ""
for data in record.findAll("td"):
    playerdata=playerdata+","+data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

header="Player,From,To,Pos,Ht,Wt,Birth Date,College"+"\n"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))


print(playerdatasaved)

1 个答案:

答案 0 :(得分:0)

该特定网站没有“x”。因此,你将获得一个404.尝试包装它,除非它将跳过404页面,它应该工作。

playerdatasaved=""
for letter in ascii_lowercase:
    try:
        soup = make_soup("http://www.basketball-reference.com/players/" + letter +      "/")
        for record in soup.find_all("tr"):
        playerdata=""
        for data in record.findAll("td"):
            playerdata=playerdata+","+data.text
        if len(playerdata)!=0:
            playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
    except:
        pass