Question

我一直在玩BeautifulSoup，这很棒。我的最终目标是尝试从页面中获取文本。我只是试图从正文中获取文本，并使用特殊情况从<a>或<img>标记中获取标题和/或alt属性。

到目前为止，我有EDITED & UPDATED CURRENT CODE：

soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page

1）您建议我的特殊情况的最佳方法是不排除上面列出的两个标签中的那些属性？如果这样做太复杂，那就不如做＃2那么重要了。

2）我想剥离标签以及它们之间的所有内容。我该怎么做呢？

QUESTION EDIT @jathanism：以下是我试图删除的一些评论标记，但即使我使用你的例子也是如此

<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->

Answer 1

直接从documentation for BeautifulSoup开始，您可以使用extract()轻松删除评论（或任何内容）：

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                        <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

Answer 2

我仍在试图找出原因找不到并剥离这样的标签： 。那些反斜杠导致某些标签值得忽视。

这可能是底层SGML解析器的问题：请参阅http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps。您可以直接使用markupMassage正则表达式覆盖它 - 直接来自文档：

import re, copy

myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)

BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

Answer 3

如果您正在BeautifulSoup版本3 BS3 Docs - Comment

中寻找解决方案

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()

Answer 4

如果您不希望携带突变，可以

[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]

如何使用BeautifulSoup从HTML中删除注释标记？

4 个答案: