我在玩 bs4 并尝试抓取以下网站:https://pythonbasics.org/selenium-get-html/ 并且我想删除所有脚本 来自 html 的标签。
为了删除脚本标签,我使用了如下函数:
for script in soup("script"):
script.decompose()
或
[s.extract() for s in soup.findAll('script')]
以及我在网上找到的许多其他人。它们都用于相同的目的,但是它们无法删除脚本标记,例如:
<script src="/lib/jquery.js"></script>
<script src="/lib/waves.js"></script>
<script src="/lib/jquery-ui.js"></script>
<script src="/lib/jquery.tocify.js"></script>
<script src="/js/main.js"></script>
<script src="/lib/toc.js"></script>
或
<div id="disqus_thread"></div>
<script>
/**
* RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
* LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
*/
var disqus_config = function () {
this.page.url = 'https://pythonbasics.org/selenium-get-html/'; // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = '_posts/selenium-get-html.md'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = '//https-pythonbasics-org.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>
这是怎么回事?我发现了一些相关的问题:
beautifulsoup remove all the internal javascript
BeatifulSoup4 get_text still has javascript
但答案推荐了我用来清理这些脚本的相同算法,但失败了。评论里还有其他人和我一样卡住了。
我查找了 nltk 以前使用的函数,但它们似乎不再有效。你有什么想法?为什么这些函数无法删除所有脚本标签。没有正则表达式我们能做什么?
答案 0 :(得分:1)
发生这种情况是因为某些 <script>
标记位于 HTML 注释 (<!-- ... -->
) 中。
您可以提取这些 HTML 注释,检查标签是否为 Comment
类型:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(html, "html.parser")
# Find all comments on the website and remove them, most of them contain `script` tags
[
comment.extract()
for comment in soup.findAll(text=lambda text: isinstance(text, Comment))
]
# Find all other `script` tags and remove them
[tag.extract() for tag in soup.findAll("script")]
print(soup.prettify())