如何使用beautifulsoup从多个表中提取数据?

时间:2015-01-26 06:33:08

标签: python html web-scraping beautifulsoup

我有28个这样的数据表。我使用Beautifulsoup提取了html。我需要通过从这些表中的单元格中删除数据来创建csv文件。我是python的新手。我尝试使用Beautifulsoup,但我不能使它工作。如何循环表以创建csv?

<table border="1" cellpadding="2" cellspacing="0" width="600">
          <tr>
           <th colspan="3">
            Chatsworth, Ga
           </th>
           <th colspan="6">
            Forecast NFDRS-88 Valid at 1300 EST Jan 26 2015
           </th>
          </tr>
          <tr>
           <td bgcolor="#C0C0C0" width="50">
            <b>
             <font size="2">
              RH (%)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="60">
            <b>
             <font size="2">
              IC
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="60">
            <b>
             <font size="2">
              BI
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="100">
            <b>
             <font size="2">
              Class Day
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="55">
            <b>
             <font size="2">
              KBDI
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="80">
            <b>
             <font size="2">
              Wind (mph)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="80">
            <b>
             <font size="2">
              Mx_Wind
              <br/>
              (mph)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="40">
            <b>
             <font size="2">
              Rn24
              <br/>
              (inch)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="50">
            <b>
             <font size="2">
              Dur
              <br/>
              (Hr)
             </font>
            </b>
           </td>
          </tr>
          <tr>
           <td align="center">
            <font size="2">
             78
            </font>
           </td>
           <td align="center">
            <font size="2">
             0
            </font>
           </td>
           <td align="center">
            <font size="2">
             12
            </font>
           </td>
           <td align="center">
            <font size="2">
             1
         Low
            </font>
           </td>
           <td align="center">
            <font size="2">
             2
            </font>
           </td>
           <td align="center">
            <font size="2">
             NW            10
            </font>
           </td>
           <td align="center">
            <font size="2">
             NW            17
            </font>
           </td>
           <td align="center">
            <font size="2">
             0.05
            </font>
           </td>
           <td align="center">
            <font size="2">
             0
            </font>
           </td>
          </tr>
          <tr>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Sow
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Temp  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Td  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Tmax  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Tmin  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              RHMax (%)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              RHMin (%)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              HrbGF
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              WdyGF
             </font>
            </b>
           </td>
          </tr>
          <tr>
           <td align="center">
            <font size="2">
             4
            </font>
           </td>
           <td align="center">
            <font size="2">
             41
            </font>
           </td>
           <td align="center">
            <font size="2">
             34
            </font>
           </td>
           <td align="center">
            <font size="2">
             41
            </font>
           </td>
           <td align="center">
            <font size="2">
             38
            </font>
           </td>
           <td align="center">
            <font size="2">
             86
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             70
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             0
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             0
            </font>
           </td>
          </tr>
          <tr>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              1-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              10-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              100-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              1000-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              X1000
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              Herbaceous
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              Woody
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              SC
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              EC
             </font>
            </b>
           </td>
          </tr>
          <tr>
           <td align="center">
            <font size="2">
             20.4
            </font>
           </td>
           <td align="center">
            <font size="2">
             20.4
            </font>
           </td>
           <td align="center">
            <font size="2">
             21.0
            </font>
           </td>
           <td align="center">
            <font size="2">
             24.7
            </font>
           </td>
           <td align="center">
            <font size="2">
             24.7
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             20.4
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             70.0
            </font>
           </td>
           <td align="center">
            <font size="2">
             7
            </font>
           </td>
           <td align="center">
            <font size="2">
             3
            </font>
           </td>
          </tr>
         </table>
         <p>
          <a name="#Dallas">
          </a>
         </p>
         <table border="1" cellpadding="2" cellspacing="0" width="600">
          <tr>
           <th colspan="3">
            Dallas, Ga
           </th>
           <th colspan="6">
            Forecast NFDRS-88 Valid at 1300 EST Jan 26 2015
           </th>
          </tr>
          <tr>
           <td bgcolor="#C0C0C0" width="50">
            <b>
             <font size="2">
              RH (%)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="60">
            <b>
             <font size="2">
              IC
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="60">
            <b>
             <font size="2">
              BI
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="100">
            <b>
             <font size="2">
              Class Day
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="55">
            <b>
             <font size="2">
              KBDI
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="80">
            <b>
             <font size="2">
              Wind (mph)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="80">
            <b>
             <font size="2">
              Mx_Wind
              <br/>
              (mph)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="40">
            <b>
             <font size="2">
              Rn24
              <br/>
              (inch)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" width="50">
            <b>
             <font size="2">
              Dur
              <br/>
              (Hr)
             </font>
            </b>
           </td>
          </tr>
          <tr>
           <td align="center">
            <font size="2">
             57
            </font>
           </td>
           <td align="center">
            <font size="2">
             3
            </font>
           </td>
           <td align="center">
            <font size="2">
             17
            </font>
           </td>
           <td align="center">
            <font size="2">
             2
         Moderate
            </font>
           </td>
           <td align="center">
            <font size="2">
             3
            </font>
           </td>
           <td align="center">
            <font size="2">
             N              7
            </font>
           </td>
           <td align="center">
            <font size="2">
             N             12
            </font>
           </td>
           <td align="center">
            <font size="2">
             0.00
            </font>
           </td>
           <td align="center">
            <font size="2">
             0
            </font>
           </td>
          </tr>
          <tr>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Sow
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Temp  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Td  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Tmax  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              Tmin  (°F)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              RHMax (%)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              RHMin (%)
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              HrbGF
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              WdyGF
             </font>
            </b>
           </td>
          </tr>
          <tr>
           <td align="center">
            <font size="2">
             4
            </font>
           </td>
           <td align="center">
            <font size="2">
             45
            </font>
           </td>
           <td align="center">
            <font size="2">
             30
            </font>
           </td>
           <td align="center">
            <font size="2">
             46
            </font>
           </td>
           <td align="center">
            <font size="2">
             38
            </font>
           </td>
           <td align="center">
            <font size="2">
             82
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             57
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             0
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             0
            </font>
           </td>
          </tr>
          <tr>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              1-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              10-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              100-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              1000-Hour
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              X1000
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              Herbaceous
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0" colspan="1">
            <b>
             <font size="2">
              Woody
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              SC
             </font>
            </b>
           </td>
           <td bgcolor="#C0C0C0">
            <b>
             <font size="2">
              EC
             </font>
            </b>
           </td>
          </tr>
          <tr>
           <td align="center">
            <font size="2">
             13.6
            </font>
           </td>
           <td align="center">
            <font size="2">
             13.6
            </font>
           </td>
           <td align="center">
            <font size="2">
             18.0
            </font>
           </td>
           <td align="center">
            <font size="2">
             20.2
            </font>
           </td>
           <td align="center">
            <font size="2">
             20.2
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             13.6
            </font>
           </td>
           <td align="center" colspan="1">
            <font size="2">
             70.0
            </font>
           </td>
           <td align="center">
            <font size="2">
             4
            </font>
           </td>
           <td align="center">
            <font size="2">
             10
            </font>
           </td>
          </tr>
         </table>

2 个答案:

答案 0 :(得分:0)

    from BeautifulSoup import BeautifulSoup
    html = '''
    PASTE YOUR HTML HERE
    '''
    bs = BeautifulSoup(html)
    csv = ''
    for table in bs.findAll('table'):
        for row in table.findChildren('tr'):
            for cell in row.findChildren('th')+row.findChildren('td'):
                csv += '"'+cell.text.replace('\r','').replace('\n','')+'"'+(','*(int(cell['colspan'])-1) if cell.has_key('colspan') else '')+','
            if len(row) > 0:
                csv += '\n'

    with open('test.csv','w') as f:
        f.write(csv.encode('utf-8'))

答案 1 :(得分:0)

您可以循环遍历每个表中的表和标记,如下所示:

soup = BeautifulSoup(<your html>)

for tbl in soup.find_all('table'):
    for td in tbl.find_all('td'):
        # do things with td
        print td.text.strip()