从通配符到正则表达式

时间:2012-12-13 15:06:33

标签: java regex string filter wildcard

我想允许两个主要通配符?*过滤我的数据。

以下是我现在正在做的事情(正如我在很多网站上看到的那样):

public boolean contains(String data, String filter) {
    if(data == null || data.isEmpty()) {
        return false;
    }
    String regex = filter.replace(".", "[.]")
                         .replace("?", ".")
                         .replace("*", ".*");
    return Pattern.matches(regex, data);
}

但是我们不应该逃避所有其他正则表达式特殊字符,例如|(等等吗?而且,如果它们前面有?,我们可以保留*\吗?例如,像:

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("([^\\\\]|^)\\?", "$1.")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("([^\\\\]|^)\\*", "$1.*")          // 3. replace any * that isn't preceded by a \ by .*
      .replaceAll("\\\\([^?*]|$)", "\\\\\\\\$1");    // 4. replace any \ that isn't followed by a ? or a * (possibly due to step 2 and 3) by \\

你怎么看?如果您同意,我是否缺少任何其他正则表达式特殊字符?


编辑#1 (考虑到dan1111和m.buettner的建议后):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars, but \, ? and *
regex = regex.replaceAll("([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

这个怎么样?


编辑#2 (考虑到dan1111的建议后):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars (if not already escaped by user), but \, ? and *
regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

目标即将到来?

3 个答案:

答案 0 :(得分:2)

替换字符串中不需要4个反斜杠来写出一个反斜杠。两个反斜杠就足够了。

您可以使用否定的lookbehind来避免替换字符串中的([^\\\\]|^)$1

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("(?<!\\\\)[?]", ".")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("(?<!\\\\)[*]", ".*")          // 3. replace any * that isn't preceded by a \ by .*

我真的没有看到你需要的最后一步。不会逃脱逃避元字符的反斜杠(反过来,实际上不会逃避它们)。我忽略了这样一个事实,你的替换呼叫会写出4个反斜杠而不是只有两个。但是说你的原始输入有th|is。然后,您的第一次替换将成为th\|is。然后,最后一次替换会使th\\|is匹配th - 反斜杠 is

您需要区分字符串在代码中的编写方式(未编译,反斜杠的两倍)以及编译后的内容(仅包含一半反斜杠)。

您可能还想考虑限制可能*的数量。像.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*!这样的正则表达式(在输入中找不到!)可能需要很长时间才能运行。该问题称为catastrophic backtracking

答案 1 :(得分:0)

最后我采用的解决方案(使用Apache Commons Lang库):

public static boolean isFiltered(String data, String filter) {
    // no filter: return true
    if (StringUtils.isBlank(filter)) {
        return true;
    }
    // a filter but no data: return false
    else if (StringUtils.isBlank(data)) {
        return false;
    }
    // a filter and a data:
    else {
        // case insensitive
        data = data.toLowerCase();
        filter = filter.toLowerCase();
        // .matches() auto-anchors, so add [*] (i.e. "containing")
        String regex = "*" + filter + "*";
        // replace any pair of backslashes by [*]
        regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
        // minimize unescaped redundant wildcards
        regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
        // escape unescaped regexps special chars, but [\], [?] and [*]
        regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
        // replace unescaped [?] by [.]
        regex = regex.replaceAll("(?<!\\\\)[?]", ".");
        // replace unescaped [*] by [.*]
        regex = regex.replaceAll("(?<!\\\\)[*]", ".*");
        // return whether data matches regex or not
        return data.matches(regex);
    }
}

非常感谢@ dan1111和@ m.buettner的宝贵帮助;)

答案 2 :(得分:0)

试试这个更简单的版本:

String regex = Pattern.quote(filter).replace("*", "\\E.*\\Q").replace("?", "\\E.\\Q");

引用整个过滤器\Q\E,然后停止*?上的引用,将其替换为等效的模式({{1} }和.*

我用

测试了它
.

输出:

String simplePattern = "ab*g\\Ei\\.lmn?p";
String data = "abcdefg\\Ei\\.lmnop";
String quotedPattern = Pattern.quote(simplePattern);
System.out.println(quotedPattern);
String regex = quotedPattern.replace("*", "\\E.*\\Q").replace("?", "\\E.\\Q");
System.out.println(regex);
System.out.println(data.matches(regex));

请注意,这是基于Oracle的\Qab*g\E\\E\Qi\.lmn?p\E \Qab\E.*\Qg\E\\E\Qi\.lmn\E.\Qp\E true 实现,我不知道是否还有其他有效的实现。