它似乎不应该匹配

时间:2016-03-29 17:42:42

标签: python regex

任何帮助,为什么这个正则表达不是&#39;匹配<td>\n等?我在pythex.org上成功测试了它。基本上我只是试图清理输出,所以它只是说myfile.doc。我也试过(<td>)?\\n\s+(</td>)?

>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>> 
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n                  myfile.doc\n                </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n                  myfile.doc\n                </td>]

2 个答案:

答案 0 :(得分:3)

在没有看到repr(filename)的情况下很难分辨,但我认为您的问题是真正的换行符与转义的换行符混淆。

比较和对比以下示例:

>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 =  '[<td>\n                  myfile.doc\n                </td>]'
>>> filename2 = r'[<td>\n                  myfile.doc\n                </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n                  myfile.doc\\n                </td>]'

答案 1 :(得分:0)

如果您的目标只是从<td>标记中获取已删除的字符串,则可以通过获取标记的stripped_strings属性让BeautifulSoup为您执行此操作:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string

如果要从相同类型的标签中提取更多字符串,可以使用findNext在当前标签之后提取下一个td标记:

filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string

然后循环......