尝试进行网页爬取:例外:在For循环中

时间:2019-01-02 05:42:36

标签: python-3.x pandas web-scraping beautifulsoup

我已经编写了以下代码,尝试使用Python,Pandas等进行网络抓取。总的来说,我要遵循四个步骤来实现所需的输出:

  1. 获取要附加到基本网址的名称列表
  2. 创建播放器特定网址的列表
  3. 使用播放器网址抓取表格
  4. 将玩家名称添加到我抓取的表格中,以跟踪哪个玩家属于哪个统计信息-因此在表格的每一行中,添加一列,其中包含用来抓取该表格的玩家名称

我能够使#的1和2正常工作。 #3的组件似乎正常工作,但我认为我的尝试有问题​​:除非因为如果我只运行一行代码来刮取特定的playerUrl,DF的表就会按预期填充。刮掉的第一个玩家没有数据,所以我认为我在捕捉错误时失败了。

对于#4,我确实还没有找到解决方案。在for循环中迭代时,如何将名称添加到列表中?

感谢您的帮助。

import requests
import pandas as pd
from bs4 import BeautifulSoup



### get the player data to create player specific urls

res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )

names=[]


for player in data:
    name = data.find_all('div', class_ = 'name')
    for obj in name:
        names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))


### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = []
x = 0
for name in (names):
    playerList = names
    newUrl = url + str(playerList[x])
    print("Gathering url..."+newUrl)
    playerUrl.append(newUrl)
    x +=1

### now take the list of urls and gather stats tables

tbls = []
i = 0
for url in (playerUrl):
    try:                                                        ### added the try, except, pass because some players have no stats table
        tables = pd.read_html(playerUrl[i], header = 0)[2]
        tbls.append(tables)
        i +=1
    except Exception:
        continue

2 个答案:

答案 0 :(得分:1)

您的脚本中有很多冗余。您可以按照以下说明清理它们。首先,我使用select()而不是find_all()来消除详细信息。要摆脱该IndexError,可以使用continue关键字,如下所示:

import requests
import pandas as pd
from bs4 import BeautifulSoup

base_url = "https://www.mlssoccer.com/players?page=0"
url = 'https://www.mlssoccer.com/players/'

res = requests.get(base_url)
soup = BeautifulSoup(res.text,'lxml')
names = []
for player in soup.select('.item-list .name a'):
    names.append(player.get_text(strip=True).replace(" ","-"))

playerUrl = {}
for name in names:
    playerUrl[name] = f'{url}{name}'

tbls = []
for url in playerUrl.values():
    if len(pd.read_html(url))<=2:continue
    tables = pd.read_html(url, header=0)[2]
    tbls.append(tables)

print(tbls)

答案 1 :(得分:0)

您可以做一些事情来改进代码,并完成第3步和第4步。

(i)使用for name in names循环时,无需显式使用索引,只需使用变量名即可。 (ii)您可以将播放器的名称及其相应的URL保存为字典,其中名称为键。然后在步骤3/4中可以使用该名称 (iii)为每个已解析的HTML表构造一个DataFrame,然后将播放器的名称附加到该表中。单独保存此数据框。
(iv)最后concatenate这些数据帧形成一个帧。

这是您的代码,其中有上述建议的更改:

import requests
import pandas as pd
from bs4 import BeautifulSoup



### get the player data to create player specific urls

res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )

names=[]

for player in data:
    name = data.find_all('div', class_ = 'name')
    for obj in name:
        names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))


### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = {}
x = 0
for name in names:
    newUrl = url + str(name)
    print("Gathering url..."+newUrl)
    playerUrl[name] = newUrl

### now take the list of urls and gather stats tables

tbls = []
for name, url in playerUrl.items():
    try:                                                        
        tables = pd.read_html(url, header = 0)[2]
        df = pd.DataFrame(tables)
        df['Player'] = name
        tbls.append(df)
    except Exception as e:
        print(e)
        continue

result = pd.concat(tbls)
print(result.head())