使用Python Beautiful soup从表中提取数据

时间:2017-02-20 19:58:49

标签: python parsing beautifulsoup

我正在尝试从以下内容中解析表中的行(离开板时间):

buscms_widget_departureboard_ui_displayStop_Callback("
 <div class='\"livetimes\"'>
 <table class='\"busexpress-clientwidgets-departures-departureboard\"'>
  <thead>
   <tr class='\"rowStopName\"'>
    <th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
     Divinity Road
    </th>
    <tr>
     <tr class='\"textHeader\"'>
      <th colspan='\"3\"'>
       text 69325694 to 84637 for live times
      </th>
      <tr>
       <tr class='\"rowHeaders\"'>
        <th>
         service
        </th>
        <th>
         destination
        </th>
        <th>
         time
        </th>
        <tr>
        </tr>
       </tr>
      </tr>
     </tr>
    </tr>
   </tr>
  </thead>
  <tbody>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
     5 mins
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
     27 mins
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4  (OBC)
    </td>
    <td class='\"colDestination\"' title='\"Abingdon\"'>
     Abingdon
    </td>
    <td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
     22:29
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
     65 mins
    </td>
   </tr>
   <tr class='\"rowServiceDeparture\"'>
    <td class='\"colServiceName\"'>
     4A  (OBC)
    </td>
    <td class='\"colDestination\"' rise\"="" title='\"Elms'>
     Elms Rise
    </td>
    <td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
     23:09
    </td>
   </tr>
  </tbody>
 </table>
</div>
<div class='\"scrollmessage_container\"'>
 <div class='\"scrollmessage\"'>
 </div>
</div>
<div class='\"services\"'>
 <a class='\"service' href='\"#\"' onclick="\&quot;serviceNameClick('');\&quot;" selected\"="">
  all
 </a>
 <a class='\"service\"' href='\"#\"' onclick="\&quot;serviceNameClick('4');\&quot;">
  4
 </a>
</div>
<div class="dptime">
 <span>
  times generated at:
 </span>
 <span>
  21:43
 </span>
</div>
");

特别是,我正在尝试提取所有的出发时间 - 所以我想从出发时间算起 - 例如12分钟之后。

我有以下代码:

# import libraries
import urllib.request
from bs4 import BeautifulSoup

# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'

# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page) 

# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')  

print(soup.prettify())

我不确定如何找到离开上面的会议记录?它是这样的:

minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'}) 

2 个答案:

答案 0 :(得分:1)

你可以尝试一下吗?

import urllib.request
from bs4 import BeautifulSoup
import re

quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'

page = urllib.request.urlopen(quote_page).read()

soup = BeautifulSoup(page, 'lxml')  

print(soup.prettify())

minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))

for elements in minutes:
    print(elements.getText())

答案 1 :(得分:1)

所以我用以下代码得到了答案 - 一旦我使用soup.find_all函数,这实际上非常简单:

import urllib.request
from bs4 import BeautifulSoup

# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'

# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page) 

# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')  

for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
    print(link.get_text())

我得到以下输出:

10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins