如何使用BeautifulSoup4解析此HTML?

时间:2015-02-25 19:08:28

标签: python parsing beautifulsoup

我想得到日期和州(在“Fecha”和“Estado”下)。该表中可能有更多的td标签

URL with the HTML to parse

<body link="#000000" vlink="#000000" alink="#000000" leftmargin="15" topmargin="0" marginwidth="0" marginheight="0" bgcolor="#FFFFFF">
<table cellspacing=0 cellpadding=0 border=0 style="width: 399px">
<tr>
    <td valign=top align=left>
        <TABLE border=0 cellPadding=0 cellSpacing=0 style="width: 403px">        
            <tr>
                <td colSpan=2><IMG  src="img/segpaqueteria_2013.jpg" ></td>
            </tr>

        </TABLE>
        <table border="0" cellspacing="0" cellpadding="0"  width=395>

        <TR bgColor=#f9f4ed height=20>
            <TD colspan=3 height=23 class=down>
            <TABLE border=0 cellPadding=0 cellSpacing=0 width=395>
                <TR bgColor=#f9e9d5 height=20>
                    <TD height=23 colspan = 4 class=down>&nbsp;&nbsp;<IMG height=10 src="img/bullet.gif" width=15>&nbsp;<font face=verdana size="1"><B>Envío Nro:</B>&nbsp;&nbsp;4463400000000000255</font></TD>                    
                </TR>
                <TR bgColor=#f9f4ed height = 15>
                    <td bgColor=#f9e9d5 width = 1></td>
                    <TD class=texto ><font face=verdana size="1"><B>&nbsp;&nbsp;Remito Nro.:</B>&nbsp;&nbsp;</font></TD>
                    <td bgColor=white width = 1></td>
                    <TD class=texto><font face=verdana size="1"><B>&nbsp;&nbsp;Paquetes:</B>&nbsp;&nbsp;1</font></TD>
                </TR>
                <TR height = 15>
                    <td bgColor=#f9e9d5 width = 1></td>
                    <TD bgColor=#f9e9d5 class=texto><font face=verdana size="1"><B>&nbsp;&nbsp;Retiro</B></font></TD>
                    <td bgColor=white width = 1></td>
                    <TD bgColor=#f9e9d5 class=texto><font face=verdana size="1"><B>&nbsp;&nbsp;Entrega</B></font></TD>
                </TR>
                <TR height = 15>
                    <td bgColor=#f9f4ed width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;AG RUSH SRL </font></td>
                    <td bgColor=white width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;S/D S/D</font></td>
                </TR>
                <TR height = 15>
                    <td bgColor=#f9f4ed width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;ARENAL CONCEPCION 3425 - 43</font></td>
                    <td bgColor=white width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;SANTIAGO 380              </font></td>
                </TR>
                <TR bgColor=#f9f4ed height = 15>
                    <td bgColor=#f9f4ed width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;Capital Federal</font></td>
                    <td bgColor=white width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;ROSARIO</font></td>
                </TR>
                <TR bgColor=#f9f4ed height = 15>
                    <td bgColor=#f9f4ed width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;1427 - CAPITAL FEDERAL               </font></td>
                    <td bgColor=white width = 1></td>
                    <TD bgColor=#f9f4ed class=texto><font face=verdana size="1">&nbsp;2000     - SANTA FE                      </font></td>
                </TR>
            </TABLE>
            </TD>                   
        </TR>
        <TR bgColor=#f9e9d5 height=20>
            <TD height=20 class=texto><font face=verdana size="1">&nbsp;&nbsp;<B>Fecha</B></font></TD>
            <td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
            <TD height=20 class=texto><font face=verdana size="1">&nbsp;&nbsp;<B>Estado</B></font></TD>
        </TR>

                <TR bgColor=#f9f4ed height=20>


                        <TD height=20 class=texto><font face=verdana size="1">24/2/2015&nbsp;&nbsp;</font></TD>

                    <td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
                    <TD height=20 class=texto><font face=verdana size="1">&nbsp;&nbsp;En Tránsito - Planta Velez Sarfield         </font></TD>                  
                </TR>

                <TR bgColor=#f9e9d5 height=20>


                        <TD height=20 class=texto><font face=verdana size="1">24/2/2015&nbsp;&nbsp;</font></TD>

                    <td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
                    <TD height=20 class=texto><font face=verdana size="1">&nbsp;&nbsp;Despachado a Sucursal de Destino - Planta Velez Sarfield         </font></TD>                 
                </TR>

                <TR bgColor=#f9f4ed height=20>


                        <TD height=20 class=texto><font face=verdana size="1">25/2/2015&nbsp;&nbsp;</font></TD>

                    <td width=1 bgcolor=#ffffff height=20><SPACER type="block"></td>
                    <TD height=20 class=texto><font face=verdana size="1">&nbsp;&nbsp;En Tránsito a Suc. de Destino - ROSARIO                       </font></TD>                    
                </TR>

        </table>
        <br>
        <center><a href="#" onclick="javascript:history.back(1)">
        <img src="img/ocaexprespak_volver.gif" border=0></a></center>
    </td>
</tr>
</table>
</div>
</body>

结果示例

  

2015年2月24日 - EnTránsito - Planta Velez Sarfield

     

2015年2月24日 - Despachado a Sucursal de Destino - Planta Velez Sarfield

     

25/2/2015 - EnTránsitoaSuc。 de Destino - ROSARIO

到目前为止我做了什么:

URL = 'https://www1.oca.com.ar/OEPTrackingWeb/detalleenviore.asp?numero=4463400000000000255'

r = requests.get(URL, headers={'User-Agent': 'User-Agent'})
s = bs4.BeautifulSoup(r.text)

print  s.body.table.table.next_sibling.next_sibling

1 个答案:

答案 0 :(得分:2)

我会查找列标签,然后从那里取出:

import re

header = s.find('b', text=re.compile('fecha', flags=re.I))
parent_row = header.find_parent('tr')
for row in parent_row.find_next_siblings('tr'):
    cells = row.find_all('td', class_='texto')
    date, entry = (c.get_text(strip=True) for c in cells)

获得标头后,代码将遍历到最近的<tr>行并迭代所有后续表行。带有文本的单元格有一个有用的texto类;在这些元素上使用Element.get_text()(使用strip=True删除这些单元格中的额外空格)会为我们提供您所需的信息。

对于您的示例网址,这会产生:

>>> import requests
>>> import bs4
>>> import re
>>> URL = 'https://www1.oca.com.ar/OEPTrackingWeb/detalleenviore.asp?numero=4463400000000000255'
>>> r = requests.get(URL, headers={'User-Agent': 'User-Agent'})
>>> s = bs4.BeautifulSoup(r.text)
>>> header = s.find('b', text=re.compile('fecha', flags=re.I))
>>> parent_row = header.find_parent('tr')
>>> for row in parent_row.find_next_siblings('tr'):
...     cells = row.find_all('td', class_='texto')
...     date, entry = (c.get_text(strip=True) for c in cells)
...     print(date, entry)
... 
24/2/2015 En Tránsito - Planta Velez Sarfield
24/2/2015 Despachado a Sucursal de Destino - Planta Velez Sarfield
25/2/2015 En Tránsito a Suc. de Destino - ROSARIO