Question

任何帮助，为什么这个正则表达不是＆＃39;匹配<td>\n等？我在pythex.org上成功测试了它。基本上我只是试图清理输出，所以它只是说myfile.doc。我也试过(<td>)?\\n\s+(</td>)?

>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>> 
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n                  myfile.doc\n                </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n                  myfile.doc\n                </td>]

Answer 1

在没有看到repr(filename)的情况下很难分辨，但我认为您的问题是真正的换行符与转义的换行符混淆。

比较和对比以下示例：

>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 =  '[<td>\n                  myfile.doc\n                </td>]'
>>> filename2 = r'[<td>\n                  myfile.doc\n                </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n                  myfile.doc\\n                </td>]'

Answer 2

如果您的目标只是从<td>标记中获取已删除的字符串，则可以通过获取标记的stripped_strings属性让BeautifulSoup为您执行此操作：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string

如果要从相同类型的标签中提取更多字符串，可以使用findNext在当前标签之后提取下一个td标记：

filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string

然后循环......

它似乎不应该匹配

2 个答案: