美丽的汤为“全球”刮

时间:2016-06-18 08:39:06

标签: python beautifulsoup scrape

我正在尝试使用Beautiful Soup在全球票房总数上搜索一些票房Mojo页面。我的代码下面会抓住国内数字就好了,赢了当我在“全球范围内”获得“国内总毛额”时工作。也许是因为“全球”不止一次出现在网页上。

任何帮助修复它?我也将过去两部分的源代码。谢谢!

以下源代码

TINYTEXT

... ...跳过

<center><table border="0" border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcdc" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$172,825,435</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=mgm.htm">MGM</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&release=theatrical&date=1988-12-16&p=.htm">December 16, 1988</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Drama</b></td><td valign="top">Runtime: <b>2 hrs. 13 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>R</b></td><td valign="top">Production Budget: <b>$25 million</b></td></tr></table>  </td>

下面的Python代码

<tr>
<td width="40%">=&nbsp;<b>Worldwide:</b></td>
<td width="35%" align="right">&nbsp;<b>$354,825,435</b></td>
<td width="25%">&nbsp;</td>
</tr>

1 个答案:

答案 0 :(得分:0)

使用div.mp_box结构中的表格来获得您想要的内容:

In [1]: from bs4 import BeautifulSoup
In [2]: import requests
In [3]: r = requests.get("http://www.boxofficemojo.com/movies/?id=rainman.htm").content

In [4]: soup = BeautifulSoup(r,"lxml")

In [5]: table = soup.select_one("div.mp_box table")

In [6]: print(table)
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$172,825,435</b></td>
<td align="right" width="25%">   <b>48.7%</b></td>
</tr>
<tr>
<td width="40%">+ <a href="/movies/?page=intl&amp;id=rainman.htm">Foreign:</a></td>
<td align="right" width="35%"> $182,000,000</td>
<td align="right" width="25%">   51.3%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
</table>

In [7]: rows = table.select("tr")

In [8]: rows[0].select_one("td + td").text
Out[8]: u'\xa0$172,825,435'

In [9]: rows[1].select_one("td + td").text
Out[9]: u'\xa0$182,000,000'

In [10]: rows[-1].select_one("td + td").text
Out[10]: u'\xa0$354,825,435'

使用文本而不指定行:

In [27]: soup = BeautifulSoup(r,"lxml")

In [28]: table = soup.select_one("div.mp_box table")

In [29]: print(table.find("b",  text="Domestic:").find_next("td").text)
 $172,825,435

In [30]: print(table.find("b",  text="Worldwide:").find_next("td").text)
 $354,825,435

 In [31]: print(table.find("a",  text="Foreign:").find_next("td").text)
 $182,000,000