Question

我试图在Java中匹配unicode字符。

输入字符串：informa

要匹配的字符串：informátion

到目前为止，我试过这个：

Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
    String s = "informátion";
    Matcher m = p.matcher(s);
    if(m.matches()){
        System.out.println("Match!");
    }else{
        System.out.println("No match");
    }

它出现为“不匹配”。有什么想法吗？

Answer 1

术语“Unicode字符”不够具体。它将匹配 Unicode范围内的每个字符，因此也匹配“普通”字符。然而，当一个实际表示“不在printable ASCII range中的字符”时，经常使用该术语。

使用正则表达式[^\x20-\x7E]。

boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");

根据您对此信息的处理方式，以下是一些有用的后续答案：

Answer 2

是因为informa根本不是informátion的子字符串吗？

如果您从正则表达式中的a删除了最后一个informa，您的代码将如何运作？

Answer 3

听起来你想要匹配字母而忽略变音标记。如果这是正确的，那么将你的字符串规范化为NFD形式，删除变音标记，然后进行搜索。

String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...

要了解有关NFD的更多信息：

如何匹配Java中的unicode字符

3 个答案: