Question

我想删除<script></script>代码之间的内容。我手动检查模式并使用while循环iterating。但是，我在这一行得到StringOutOfBoundException：

String script=source.substring(startIndex,endIndex-startIndex);

以下是完整的方法：

public static String getHtmlWithoutScript(String source){
        String START_PATTERN = "<script>";
        String END_PATTERN = " </script>";
        while(source.contains(START_PATTERN)){
            int startIndex=source.lastIndexOf(START_PATTERN);
            int endIndex=source.indexOf(END_PATTERN,startIndex);

           String script=source.substring(startIndex,endIndex);
           source.replace(script,"");
        }
        return source;
    }

我在这里做错了吗？而且我得到了endIndex=-1。任何人都可以帮我识别，为什么我的代码会破坏。

提前致谢

Answer 1

String text = "<script>This is dummy text to remove </script> dont remove this";
    StringBuilder sb = new StringBuilder(text);
    String startTag = "<script>";
    String endTag = "</script>";

    //removing the text between script
    sb.replace(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag), "");

    System.out.println(sb.toString());

如果要删除脚本标记，请添加以下行：

sb.toString().replace(startTag, "").replace(endTag, "")

更新：

如果你不想使用StringBuilder，你可以这样做：

    String text = "<script>This is dummy text to remove </script> dont remove this";
    String startTag = "<script>";
    String endTag = "</script>";

    //removing the text between script
    String textToRemove = text.substring(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag));
    text = text.replace(textToRemove, "");

    System.out.println(text);

Answer 2

您可以使用正则表达式删除脚本标记内容：

public String removeScriptContent(String html) {
         if(html != null) {
            String re = "<script>(.*)</script>";

            Pattern pattern = Pattern.compile(re);
            Matcher matcher = pattern.matcher(html);
            if (matcher.find()) {
                return html.replace(matcher.group(1), "");
            }
        }
        return null;
     }

你必须添加这两个导入：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Answer 3

我知道我晚会很晚。但我想给您一个正则表达式（经过实际测试的解决方案）。

您在这里需要注意的是，对于正则表达式，默认情况下其引擎是贪婪的。因此，诸如<script>(.*)</script>之类的搜索字符串将匹配从<script>开始直到行末尾或文件末尾的整个字符串，具体取决于所使用的regexp选项。 这是由于搜索引擎默认使用贪婪匹配。

现在，为了准确地执行您要进行的匹配...您可以使用“惰性”搜索。

搜索延迟加载 <script>(.*?)<\/script>

现在，您将获得准确的结果。

您可以在此answer中阅读有关Regexp Lazy＆Greedy的更多信息。

如何删除<script> </script>标记之间的文本

3 个答案: