使用python和漂亮的汤提取多个表数据

时间:2018-10-07 00:44:42

标签: python beautifulsoup

<div class="row margin_30">
<div class="col-md-12 col-sm-12 col-xs-12 col-lg-12">
<div class="table-responsive table-border-radius">
<table class="table  table-hover result-table-new1 " style="margin:0">
<thead class="">
<tr class="">
<th style="text-align:center;">Pl</th>
<th>H.No</th>
<th>Horse/Pedigree</th>
<th>Desc</th>
<th>Trainer</th>
<th>Jockey</th>
<th>Wt</th>
<th>Al</th>
<th>Dr</th>
<th>Sh</th>
<th>Won By</th>
<th>Dist Win</th>
<th>Rtg</th>
<th>Odds</th>
<th>Time</th>
</tr>
</thead>
<tbody class="">
<tr class="dividend_tr"  >
<td>1 </td>
<td style="text-align: center;">7 </td>
<td class="race_card_td"><h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55234/SILKEN 
 STRIKER">
 SILKEN STRIKER </a></h5>
<h6 class="margin_remove">Sussex(GB)-Flying Rani </h6>
</td>
<td>
4y b g </td>
<td>
Irfan Ghatala </td>
<td>
 Anjar Alam </td>
<td>
 56 </td>
<td>
 - </td>
<td>
 6 </td>
 <td>
 A </td>
<td>
5 1/2 </td>
<td>
</td>
<td>
12 </td>
<td>
</td>
<td>
1:14.57 </td>
</tr>
<tr class="dividend_tr"  >
<td>
2 </td>
<td style="text-align: center;">
5 </td>
<td class="race_card_td">
<h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55737/ULTIMATE 
POWER">
ULTIMATE POWER </a>
</h5>
<h6 class="margin_remove">
Epicentre(USA)-Methodical </h6>
</td>
<td>
4y b g </td>
<td>
V Lokanath </td>
<td>
Darshan R N </td>
<td>
57 </td>
<td>
-1 </td>
<td>
3 </td>
<td>
A </td>
<td>
5 </td>
<td>
5.5 </td>
<td>
14 </td>
<td>
</td>
<td>
1:15.47 </td>
</tr>
</tbody>
</table>
</div>

我希望使用Beautiful汤添加以下输出并将其存储在csv文件中。实际页面[http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS]具有多个表和许多行。另外,我需要编写一个函数以从不同页面获取数据。

[结果] [1]

 [1]: https://i.stack.imgur.com/4LYt8.jpg

任何帮助都会很棒。

1 个答案:

答案 0 :(得分:0)

这很简单,您需要找到所有表,然后根据需要迭代tr和td。您可以使用熊猫来保存抓取的数据。我已经为您解析了表(其余的工作要做)...检查下面的代码。

import requests
from bs4 import BeautifulSoup

url = 'http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS'
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')
table = soup.find_all('table', attrs={
    'class':'result-table-new1'})
for i in table:
    tr = i.find_all('tr')
    for td in tr:
        print(td.text.replace('\n', ' '))
相关问题