新手试图抓取数据并将其分解

时间:2017-10-31 13:31:23

标签: python

我能够从网站上搜集一些数据,但我无法将其分解以显示在表格中。

我使用的代码是:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
tablesright = soup.find_all('td', 'right',)
Tables left = soup.find_all('td', 'left')
print (tablesright + tablesleft)

这给了我这样的结果:

====================== RESTART: E:/2017/Python2/box2.py   ======================
[<td class="right " data-stat="game_start_time">8:01 pm</td>, <td class="right " data-stat="visitor_pts">99</td>, <td class="right " data- stat="home_pts">102</td>, <td class="right " data-stat="game_start_time">10:30 pm</td>, <td class="right " data-stat="visitor_pts">122</td>, <td class="right " data-stat="home_pts">121</td>, <td class="right " data-stat="game_start_time">7:30 pm</td>, <td class="right " data-stat="visitor_pts">108</td>, <td class="right " data-stat="home_pts">100</td>, <td class="right " data-stat="game_start_time">8:30 pm</td>, <td class="right " data-stat="visitor_pts">117</td>, <td class="right " data-stat="home_pts">111</td>, <td class="right " data-stat="game_start_time">7:00 pm</td>, <td class="right " data-stat="visitor_pts">90</td>, <td class="right " data-stat="home_pts">102</td>, <

和左侧部分:

<td class="left " csk="BOS.201710170CLE" data-stat="visitor_team_name"><a href="/teams/BOS/2018.html">Boston Celtics</a></td>, <td class="left " csk="CLE.201710170CLE" data-stat="home_team_name"><a href="/teams/CLE/2018.html">Cleveland Cavaliers</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="HOU.201710170GSW" data-stat="visitor_team_name"><a href="/teams/HOU/2018.html">Houston Rockets</a></td>, <td class="left " csk="GSW.201710170GSW" data-stat="home_team_name"><a href="/teams/GSW/2018.html">Golden State Warriors</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="MIL.201710180BOS" data-stat="visitor_team_name"><a href="/teams/MIL/2018.html">Milwaukee Bucks</a></td>, <td class="left " csk="BOS.201710180BOS" data-stat="home_team_name"><a href="/teams/BOS/2018.html">Boston Celtics</a></td>, <td class="left " data-stat="game_remarks"></td>, <td class="left " csk="ATL.201710180DAL" data-

好的,现在我无法弄清楚如何打破结果,所以它会有一个像这样的好桌子:

Game start time    Home team.     Score.   Away team.    Score
7pm.               Boston.        104.     Golden state.  103

拔出我的头发试图找出来,

提前谢谢

4 个答案:

答案 0 :(得分:1)

您可以尝试在pandas数据框中读取它而不是使用html解析器,然后决定如何操作该数据帧以显示您需要的结果。

示例:

import pandas as pd


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
dfs = pd.read_html(url, match="Start")
print(dfs[0])

如何在pandas文档中执行此操作的示例以及有关stackoverflow的许多问题。 酱:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

答案 1 :(得分:1)

我不知道您是否想要使用pandas进行解决方案,只需使用更高级的attrs关键字和标准Python format来获取格式化表,就可以了解它。

请注意,format中的数字是手动选择的,不会根据实际数据进行调整。

import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
game_start_times = soup.find_all('td', attrs={"data-stat": "game_start_time", "class": "right"})
visitor_team_names = soup.find_all('td', attrs={"data-stat": "visitor_team_name", "class": "left"})
visitor_ptss = soup.find_all('td', attrs={"data-stat": "visitor_pts", "class": "right"})
home_team_names = soup.find_all('td', attrs={"data-stat": "home_team_name", "class": "left"})
home_pts = soup.find_all('td', attrs={"data-stat": "home_pts", "class": "right"})

for i in range(len(game_start_times)):
    print('{:10s} {:28s} {:5s} {:28s} {:5s}'.format(game_start_times[i].text.strip(),
                                  visitor_team_names[i].text.strip(),
                                  visitor_ptss[i].text.strip(),
                                  home_team_names[i].text.strip(),
                                  home_pts[i].text.strip()))
8:01 pm    Boston Celtics               99    Cleveland Cavaliers          102
10:30 pm   Houston Rockets              122   Golden State Warriors        121
7:30 pm    Milwaukee Bucks              108   Boston Celtics               100
8:30 pm    Atlanta Hawks                117   Dallas Mavericks             111

答案 2 :(得分:0)

这样可行。调整它以满足您的需求,然后使用熊猫。

import requests
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")

rows = soup.select('#schedule > tbody > tr')

for row in rows:
    rights = row.find_all("td", "right")
    lefts = row.find_all("td", "left")

    print rights[0].text, lefts[0].text, rights[1].text, lefts[1].text, rights[2].text

答案 3 :(得分:0)

对于这样一个简单的结构,我只是删除库并用re(正则表达式)

来完成它

首先 findall 以获取所有 tr 标记

然后一个 findall 在每个tr标记内获取所有 td / th 标记

然后一个 sub 来过滤掉字段内的所有标记(主要是标记)

#!/usr/bin/python

import requests
import re

url = 'https://www.basketball-
reference.com/leagues/NBA_2018_games.html'
r = requests.get(url)
content = r.content

data = [
    {
            k:re.sub('<.+?>','',v) for (k,v) in re.findall('<t[dh].+?data\-stat="(.*?)".*?>(.*?)</t[dh]',tr)
    } for tr in re.findall('<tr.+?>(.+?)</tr',content)
    ]

for game in data:
  print "%s" % game['date_game']
  for info in game:
    print "  %s = %s" % (info,game[info])

这提供了一个很好的字典结构(数据),可以很容易地用于显示:

$ ./scores_url.py 
Tue, Oct 17, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Cleveland Cavaliers
  visitor_team_name = Boston Celtics
  game_start_time = 8:01 pm
  date_game = Tue, Oct 17, 2017
  overtimes = 
  visitor_pts = 99
  home_pts = 102
Tue, Oct 17, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Golden State Warriors
  visitor_team_name = Houston Rockets
  game_start_time = 10:30 pm
  date_game = Tue, Oct 17, 2017
  overtimes = 
  visitor_pts = 122
  home_pts = 121
Wed, Oct 18, 2017
  game_remarks = 
  box_score_text = Box Score
  home_team_name = Boston Celtics
  visitor_team_name = Milwaukee Bucks
  game_start_time = 7:30 pm
  date_game = Wed, Oct 18, 2017
  overtimes = 
  visitor_pts = 108
  home_pts = 100
...

或以你的例子的风格:

cols = [
        ['game_start_time',15,"Game start time"],
        ['home_team_name',25,"Home team."],
        ['home_pts',7,"Score."],
        ['visitor_team_name',25,"Away team."],
        ['visitor_pts',7,"Score."]
       ]

for col in cols:
  print ("%%%ds" % col[1]) % col[2],
print

for game in data:
  for col in cols:
    print ("%%%ds" % col[1]) % game[col[0]],
  print

这样的东西:

Game start time                Home team.  Score.                Away team.  Score.
        8:01 pm       Cleveland Cavaliers     102            Boston Celtics      99
       10:30 pm     Golden State Warriors     121           Houston Rockets     122
        7:30 pm            Boston Celtics     100           Milwaukee Bucks     108
        8:30 pm          Dallas Mavericks     111             Atlanta Hawks     117
        7:00 pm           Detroit Pistons     102         Charlotte Hornets      90
        7:00 pm            Indiana Pacers     140             Brooklyn Nets     131
        8:00 pm         Memphis Grizzlies     103      New Orleans Pelicans      91
    ...