python re.compile美丽的汤

时间:2011-11-27 20:40:10

标签: python regex beautifulsoup

desc = re.compile('<ul class="descShort bullet">(.*)</ul>', re.DOTALL)
findDesc = re.findall(desc, link_source)

for i in findDesc:
    print i


'''
<ul class="descShort bullet">

      Sleek and distinctive, these eye-catching ornaments will be the star of your holiday decor. These unique glass icicle ornaments are individually handcrafted by artisans in India.

  </ul>
'''

我试图在ul class tag和/ ul之间提取描述。我正在寻找使用REGEX以及beautifulsoup的解决方案。

1 个答案:

答案 0 :(得分:1)

首先,使用正则表达式解析HTML / XML通常被视为a bad idea。 因此,使用像BeautifulSoup这样的解析器确实是一个更好的主意。

您想要的是如下:

from BeautifulSoup import BeautifulSoup

text = """
<ul class="descShort bullet">text1</ul>
<a href="example.com">test</a>
<ul class="descShort bullet">one more</ul>
<ul class="other">text2</ul>
"""

soup = BeautifulSoup(text)

# to get the contents of all <ul> tags:
for tag in soup.findAll('ul'):
    print tag.contents[0]

# to get the contents of <ul> tags w/ attribute class="descShort bullet":
for tag in soup.findAll('ul', {'class': 'descShort bullet'}):
    print tag.contents[0]