Question

desc = re.compile('<ul class="descShort bullet">(.*)</ul>', re.DOTALL)
findDesc = re.findall(desc, link_source)

for i in findDesc:
    print i


'''
<ul class="descShort bullet">

      Sleek and distinctive, these eye-catching ornaments will be the star of your holiday decor. These unique glass icicle ornaments are individually handcrafted by artisans in India.

  </ul>
'''

我试图在ul class tag和/ ul之间提取描述。我正在寻找使用REGEX以及beautifulsoup的解决方案。

Answer 1

首先，使用正则表达式解析HTML / XML通常被视为a bad idea。因此，使用像BeautifulSoup这样的解析器确实是一个更好的主意。

您想要的是如下：

from BeautifulSoup import BeautifulSoup

text = """
<ul class="descShort bullet">text1</ul>
<a href="example.com">test</a>
<ul class="descShort bullet">one more</ul>
<ul class="other">text2</ul>
"""

soup = BeautifulSoup(text)

# to get the contents of all <ul> tags:
for tag in soup.findAll('ul'):
    print tag.contents[0]

# to get the contents of <ul> tags w/ attribute class="descShort bullet":
for tag in soup.findAll('ul', {'class': 'descShort bullet'}):
    print tag.contents[0]

python re.compile美丽的汤

1 个答案: