我正在尝试使用Beautiful Soup在全球票房总数上搜索一些票房Mojo页面。我的代码下面会抓住国内数字就好了,赢了当我在“全球范围内”获得“国内总毛额”时工作。也许是因为“全球”不止一次出现在网页上。
任何帮助修复它?我也将过去两部分的源代码。谢谢!
以下源代码
TINYTEXT
... ...跳过
<center><table border="0" border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcdc" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$172,825,435</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=mgm.htm">MGM</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&release=theatrical&date=1988-12-16&p=.htm">December 16, 1988</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Drama</b></td><td valign="top">Runtime: <b>2 hrs. 13 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>R</b></td><td valign="top">Production Budget: <b>$25 million</b></td></tr></table> </td>
下面的Python代码
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td width="35%" align="right"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
答案 0 :(得分:0)
使用div.mp_box
结构中的表格来获得您想要的内容:
In [1]: from bs4 import BeautifulSoup
In [2]: import requests
In [3]: r = requests.get("http://www.boxofficemojo.com/movies/?id=rainman.htm").content
In [4]: soup = BeautifulSoup(r,"lxml")
In [5]: table = soup.select_one("div.mp_box table")
In [6]: print(table)
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$172,825,435</b></td>
<td align="right" width="25%"> <b>48.7%</b></td>
</tr>
<tr>
<td width="40%">+ <a href="/movies/?page=intl&id=rainman.htm">Foreign:</a></td>
<td align="right" width="35%"> $182,000,000</td>
<td align="right" width="25%"> 51.3%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
</table>
In [7]: rows = table.select("tr")
In [8]: rows[0].select_one("td + td").text
Out[8]: u'\xa0$172,825,435'
In [9]: rows[1].select_one("td + td").text
Out[9]: u'\xa0$182,000,000'
In [10]: rows[-1].select_one("td + td").text
Out[10]: u'\xa0$354,825,435'
使用文本而不指定行:
In [27]: soup = BeautifulSoup(r,"lxml")
In [28]: table = soup.select_one("div.mp_box table")
In [29]: print(table.find("b", text="Domestic:").find_next("td").text)
$172,825,435
In [30]: print(table.find("b", text="Worldwide:").find_next("td").text)
$354,825,435
In [31]: print(table.find("a", text="Foreign:").find_next("td").text)
$182,000,000