Question

我在玩 bs4 并尝试抓取以下网站：https://pythonbasics.org/selenium-get-html/ 并且我想删除所有脚本来自 html 的标签。

为了删除脚本标签，我使用了如下函数：

for script in soup("script"):
     script.decompose()

或

[s.extract() for s in soup.findAll('script')]

以及我在网上找到的许多其他人。它们都用于相同的目的，但是它们无法删除脚本标记，例如：

<script src="/lib/jquery.js"></script>
<script src="/lib/waves.js"></script>
<script src="/lib/jquery-ui.js"></script>
<script src="/lib/jquery.tocify.js"></script>

<script src="/js/main.js"></script>
<script src="/lib/toc.js"></script>

或

<div id="disqus_thread"></div>
    <script>
        /**
         *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
         *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables
         */
        
        var disqus_config = function () {
            this.page.url = 'https://pythonbasics.org/selenium-get-html/';  // Replace PAGE_URL with your page's canonical URL variable
            this.page.identifier = '_posts/selenium-get-html.md'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
        };
        
        (function() {  // DON'T EDIT BELOW THIS LINE
            var d = document, s = d.createElement('script');
            
            s.src = '//https-pythonbasics-org.disqus.com/embed.js';
            
            s.setAttribute('data-timestamp', +new Date());
            (d.head || d.body).appendChild(s);
        })();
    </script>
    <noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript" rel="nofollow">comments powered by Disqus.</a></noscript>

这是怎么回事？我发现了一些相关的问题：

beautifulsoup remove all the internal javascript

BeatifulSoup4 get_text still has javascript

但答案推荐了我用来清理这些脚本的相同算法，但失败了。评论里还有其他人和我一样卡住了。

我查找了 nltk 以前使用的函数，但它们似乎不再有效。你有什么想法？为什么这些函数无法删除所有脚本标签。没有正则表达式我们能做什么？

Answer 1

发生这种情况是因为某些 <script> 标记位于 HTML 注释 () 中。

您可以提取这些 HTML 注释，检查标签是否为 Comment 类型：

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(html, "html.parser")

# Find all comments on the website and remove them, most of them contain `script` tags
[
    comment.extract()
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment))
]

# Find all other `script` tags and remove them
[tag.extract() for tag in soup.findAll("script")]

print(soup.prettify())

Beautiful Soup 无法删除所有脚本标签

1 个答案: