Question

我希望以下正则表达式匹配，但事实并非如此。为什么呢？

import re
html = '''
                <a href="#">
                    <img src="logo.png" alt="logo" width="100%">
                    </img>
                 </a>
  '''
m = re.match( r'.*logo.*', html, re.M|re.I)

if m: 
    print m.group(1)
if not m:
    print "not found"

Answer 1

我们不使用正则表达式来解析HTML。

在我之后重复：我们不使用REGEX到PARSE HTML。

也就是说，它不起作用，因为re.match显式只检查行的开头。请改用re.search或re.findall。

Answer 2

使用re.search。 re.match假设匹配位于字符串的开头。

Answer 3

您需要包含re.DOTALL（== re.S）标志以允许。匹配换行符（\ n）。

但是，如果“logo”出现在其中的任何位置，则返回整个文档;非常有用。

稍微好一点

import re
html = """
    <a href="#">
        <img src="logo.png" alt="logo" width="100%" />
    </a>
"""

match_logo = re.compile(r'<[^<]*logo[^>]*>', flags = re.I | re.S)

for found in match_logo.findall(html):
    print(found)

返回

<img src="logo.png" alt="logo" width="100%" />

更好的是

from bs4 import BeautifulSoup

pg = BeautifulSoup(html)
print pg.find("img", {"alt":"logo"})

为什么这个正则表达式不起作用：r'。* logo。*'

3 个答案: