Question

首先，我很抱歉这个糟糕的问题，但我无法找到一个更好的问题。

因此，我正在尝试使用Python构建一个小工具来提高我的技能，它会从Imdb.com上删除数据并输出从HTML中过滤的标题和其他内容。

我正在将此RegEx用于我的搜索：<h3 class="findSectionHeader"><a name="tt"><\/a>Titles<\/h3>[\s]{0,3}(.*?)<\/td> <\/tr><\/table>，这应该会导致a>Titles<\/h3>之后和<\/tr><\/table>之前的所有内容，但我做错了。我添加了[\ s] {0,3}，因为我认为这可能是因为\ n或其他东西，但它根本没有解决它。

这是源块：

<div class="findSection">
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
<table class="findList">
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" >
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" />
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" >
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a>
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>

Answer 1

尝试使用以下正则表达式：

(?s)(?<=<\/h3>\n).*?(?=</tr></table>)

参见 regex demo / explanation

<强>蟒

import re
regex = r"(?s)(?<=<\/h3>\n).*?(?=</tr></table>)"
str = """<div class="findSection">
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
<table class="findList">
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" >
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" />
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" >
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a>
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>"""
matches = re.finditer(regex, str)
for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

Answer 2

您可以将re.DOTALL标记添加到re来电，以便.与换行符匹配：

src = '''<div class="findSection">
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3>
<table class="findList">
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" >
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" />
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" >
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a>
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>'''

expr = r'<h3 class="findSectionHeader"><a name="tt"><\/a>Titles<\/h3>[\s]{0,3}(.*?)<\/td> <\/tr><\/table>'

import re

print re.findall( expr, src, re.DOTALL )

的产率：

['<table class="findList">\n<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" >\n<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" />\n</a> </td> <td class="result_text"> \n<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" >\n<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> \n<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a>\n</td> <td class="result_text"> \n<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) ']

正则表达式，此RegEx有什么问题？

2 个答案: