Python美丽的汤问题

时间:2013-09-06 14:45:03

标签: python html parsing url beautifulsoup

想象一下,我的内容中包含带元标记的html,如

<meta property="og:country-name" content="South Africa"/>

问题是,我需要从整页的html标记中获取国家的名称

from bs4 import BeautifulSoup as BS
url ="mydomain.com"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
soup = BS(data)
print soup.findAll(...

无法弄清楚下一步必须是什么。有什么建议吗?

1 个答案:

答案 0 :(得分:2)

搜索具有特定属性的<meta>标记:

country_meta = soup.find('meta', attrs={'property': 'og:country-name', 'content': True})
if country_meta:
    country = country_meta['content']

演示:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head>
...     <meta property="og:country-name" content="South Africa"/>
...     <title>Foo</title>
... </head><body></body></html>''')
>>> country_meta = soup.find('meta', attrs={'property': 'og:country-name', 'content': True})
>>> country_meta
<meta content="South Africa" property="og:country-name"/>
>>> print country_meta['content']
South Africa