提取表中的所有tr标签数据

时间:2016-06-13 17:38:37

标签: python-3.x beautifulsoup find html-table

HTML code:

<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr><th>Name</th><th>Email</th><th>Supervisor</th><th>Room</th><th>Phone</th></tr>
<tr>
<td>Anastasiou, Alexandros</td>
<td><a href="mailto:alexandros.anastasiou07">alexandros.anastasiou07</a></td>
<td>Prof Duff</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Ashmore, Anthony</td>
<td><a href="mailto:a.ashmore12">a.ashmore12</a></td>
<td>Prof Waldram</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Banks, Elliot</td>
<td><a href="mailto:EB713">EB713</a></td>
<td>Prof Gauntlett</td>
<td>512a</td>
<td>47839</td>
</tr>
</tbody>
</table>

以上是html代码。在每个tr的第3个td标签中包含更多标签......请帮帮我。

我的python代码:

    souphandler=BeautifulSoup(htmltext)

    table=souphandler.find('table')
    tr_tag=table.find('tr')
    try:
        while(tr_tag is not None):
            for row in tr_tag:
                print(row.string)
            tr_tag=tr_tag.findNext('tr')  

在此代码中,它反复多次打印所有内容。我想提取tr标签中的所有数据..

1 个答案:

答案 0 :(得分:0)

你需要找到tr标签并从第一个标签中提取th标签,从其他标签中提取td标签:

h = """
  <table border="0" cellpadding="0" cellspacing="0">
<tr><th>Name</th><th>Email</th><th>Supervisor</th><th>Room</th><th>Phone</th></tr>
<tr>
<td>Anastasiou, Alexandros</td>
<td><a href="mailto:alexandros.anastasiou07">alexandros.anastasiou07</a></td>
<td>Prof Duff</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Ashmore, Anthony</td>
<td><a href="mailto:a.ashmore12">a.ashmore12</a></td>
<td>Prof Waldram</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Banks, Elliot</td>
<td><a href="mailto:EB713">EB713</a></td>
<td>Prof Gauntlett</td>
<td>512a</td>
<td>47839</td>
</tr>
</table>"""


soup = BeautifulSoup(h)
table = soup.find("table")
print(",".join([th.text for th in table.find("tr").find_all("th")]))
for tr in table.select("tr + tr"):
    tds = tr.find_all("td")
    print(tds[1].a["href"])
    print(", ".join([td.text for td in tds]))

哪会给你:

Name,Email,Supervisor,Room,Phone
mailto:alexandros.anastasiou07
Anastasiou, Alexandros, alexandros.anastasiou07, Prof Duff, 512b, 47838
mailto:a.ashmore12
Ashmore, Anthony, a.ashmore12, Prof Waldram, 512b, 47838
mailto:EB713
Banks, Elliot, EB713, Prof Gauntlett, 512a, 47839
相关问题