使用beautifulsoup获取多个标签和属性数据

时间:2014-08-15 13:02:40

标签: python html parsing beautifulsoup

我想使用beautifulsoup从以下HTML中获取多个标签和属性

1)div id = home_1039509

2)div id =“guest_1039509

3)的id =“odds_3_1039509

4)的id =“gs_1039509

5)的id =“hs_1039509

6)的id =“time_1039509

HTML:

  <tr align="center" height="15" id="tr_1039509" bgcolor="#F7F3F7" index="0">
    <td width="10">
       <img src="images/lclose.gif" onclick="hidematch(0)" style="cursor:pointer;">
    </td>
  <td width="63" bgcolor="#d15023">
    <font color="#ffffff">U18<br>
       <span id="t_1039509">14:05</span>
    </font>
  </td>
  <td width="115" style="text-align:left;">
  <div id="home_1039509">
       <a href="javascript:Team(19195)">U18()</a>
  </div>
  <div class="oddsAns"> 
       &nbsp;[
  <a href="javascript:AsianOdds('1039509')">A</a>
   -
  <a href="javascript:EuropeOdds(1039509)" target="_self">B</a>
   -
  </div>
 <div id="guest_1039509">
  <a href="javascript:Team(11013)">U18</a>
 </div>
 </td>
 <td width="30">
     <div id="gs_1039509" class="score">2</div>
 <div id="time_1039509">
     42
     <img src="images/in.gif" border="0">
 </div>
 <div id="hs_1039509" class="score">1</div></td>
 <td width="90" id="odds_1_1039509" title=""></td>
 <td width="90" id="odds_4_1039509" title=""></td>
 <td width="90" id="odds_3_1039509" title="">
     <a class="sb" href="javascript:" onclick="ChangeDetail3(1039509,'3')">0.94</a>                            
 <img src="images/t3.gif">
   <br>
     <a class="pk" href="javascript:" onclick="ChangeDetail3(1039509,'3')">2.5/3</a>            
   <br>
     0.86
 </td>
 <td width="90" id="odds_31_1039509" title="nothing"></td>
    </tr>

代码:

rows = table.findAll("tr", {"id" : re.compile('tr_*\d')})

for tr in rows:
    cols = tr.findAll("span", {"id" : re.compile('t_*\d')}) &
    cols = tr.findAll("div", {"id" : re.compile('home_*\d')}) &
    cols = tr.findAll("span", {"id" : re.compile('guest_*\d')}) &
    cols = tr.findAll("span", {"id" : re.compile('guest_*\d')}) &
    cols = tr.findAll("span", {"id" : re.compile('odds_3_*\d')}) &
    cols = tr.findAll("span", {"id" : re.compile('hs_*\d')})

for td in cols:
    t = td.find(text=True)
    if t:
        text = t + ';' # concat
    print text,
print

2 个答案:

答案 0 :(得分:3)

您可以传递a function并检查是否id starts with home_guest_等:

from bs4 import BeautifulSoup

f = lambda x: x and x.startswith(('home_', 'guest_', 'odds_', 'gs_', 'hs_', 'time_'))

soup = BeautifulSoup(open('test.html'))
print [element.get_text(strip=True) for element in soup.find_all(id=f)]

打印:

[u'U18()', u'U18', u'2', u'42', u'1', u'', u'', u'0.942.5/30.86', u'']

请注意startswith()允许传递一串字符串进行检查。

答案 1 :(得分:1)

您可以获取 cols 列表,例如

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

soup.find_all(["div", "span"], id=re.compile('[home|guest|odds_3|gs|hs|time]_\d+'))

正则表达式只是一个例子

在你的情况下,它可以是

cols = tr.find_all(["div", "span"], id=re.compile('[home|guest|odds|gs|hs|time]_\d+'))

for tag in cols:
    # find(text=True) only returns data if immediate node has text
    # incase <div><span>123</span></div> will return None
    t = td.find_all(text=True)
    if t:
        # find_all will return list so need to join
        text = ''.join(t).strip() + ';'
    print(text)