使用正则表达式并重建原始字符串

时间:2016-05-13 20:50:56

标签: java regex

我有这样的文字 -

This is a test text. <span> with bold </span> and with <span> italic </span> and so on and so forth.

现在,我正在使用此正则表达式来识别所有html <[^>]*> 然后我用空字符串替换所有的html,结果就像这样

This is a test text. with bold and with italic and so and so forth.

在上面的文字中,我想识别文本,例如“斜体”,并在其周围插入特殊标签,然后重建原始文本。所以,结果将是

This is a test text. <span> with bold </span> and with <span> <span class='special'>italic</span> </span> and so on and so forth.

我正在创建获取matcher.start()和matcher.end()的代码来制作所有html标签的列表,然后我正在考虑基于此列表进行重建。有没有更好的方法呢?你会如何解决它?

修改

替换html后搜索文本的原因是因为,html会干扰我正在寻找的文本。例如,它可能就像这样

This is a test text. <span> with bold </span> and with <span> it</span>al<span>ic </span> and so on and so forth.

EDIT2

这不是一个重复的问题,就像它被建议一样。想象一个场景,你需要突出显示你在屏幕上看到的html,除了在你选择的文本中添加一个黄色背景颜色的简单跨度。现在,假设此文本是斜体,但它显示为<span>ita</span>l<span>ic</span>。我的问题是你如何找到这个词,然后在它周围添加跨度?

EDIT3 最终编辑以简化问题陈述。我希望这说清楚。 这是输入 -

This is a test text with <span>it<span>al<span>ic</span> and etc.

这是预期的输出 -

This is a test text with <span class='highlight'><span>it<span>al<span>ic</span></span> and etc.

1 个答案:

答案 0 :(得分:1)

这将执行您正在寻找的内容,但它不会检测/防止错误的html生成。

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class HtmlHighlighter {
  private final String inputWithoutTags;
  private final List<Tag> tags;

  private static class Tag {
    private final String text;
    private final int startPos;

    private Tag(final String text, final int startPos) {
      this.text = text;
      this.startPos = startPos;
    }
  }

  public HtmlHighlighter(final String input, final String tagRegex) {
    final Pattern p = Pattern.compile(tagRegex);
    tags = new ArrayList<>();
    final Matcher m = p.matcher(input);
    StringBuffer sb = new StringBuffer();
    int cursor = 0;
    int cursorExcludingTags = 0;
    while (m.find()) {
      cursorExcludingTags += m.start() - cursor;
      tags.add(new Tag(input.substring(m.start(), m.end()), cursorExcludingTags));
      cursor = m.end();
      m.appendReplacement(sb, "");
    }
    m.appendTail(sb);
    inputWithoutTags = sb.toString();
  }

  public String highlightText(String regexToFind, String openingTag, String closingTag) {
    final List<Tag> allTags = getAllTags(regexToFind, openingTag, closingTag);
    return combineTags(allTags);
  }

  private List<Tag> getAllTags(final String regexToFind, final String openingTag, final String closingTag) {
    final List<Tag> ret = new ArrayList<>(tags);
    final Pattern p = Pattern.compile(regexToFind);
    final Matcher m = p.matcher(inputWithoutTags);
    while (m.find()) {
      addTag(new Tag(openingTag, m.start()), true, ret);
      addTag(new Tag(closingTag, m.end()), false, ret);
    }
    return ret;
  }

  private void addTag(final Tag tag, final boolean beforeIgnored, final List<Tag> allTags) {
    for (int i = 0; i < allTags.size(); i++) {
      if (allTags.get(i).startPos >= tag.startPos && beforeIgnored) {
        allTags.add(i, tag);
        return;
      }
      if (allTags.get(i).startPos > tag.startPos) {
        allTags.add(i, tag);
        return;
      }
    }
    allTags.add(allTags.size(), tag);
  }

  private String combineTags(final List<Tag> allTags) {
    final StringBuilder sb = new StringBuilder(inputWithoutTags);
    for (int i = allTags.size() - 1; i >= 0; i--) {
      final Tag tag = allTags.get(i);
      sb.insert(tag.startPos, tag.text);
    }
    return sb.toString();
  }

  public static void main(String... args) {
    final HtmlHighlighter highlighter = new HtmlHighlighter("This is a test text with <span>it<span>al<span>ic</span> and etc.", "\\<.*?\\>");
    System.out.println(highlighter.highlightText("italic", "<span class='highlight'>", "</span>"));
  }
}
相关问题