Question

当我使用正则表达式时，显然Java的正则表达式将变音符号和其他特殊字符计为非“单词字符”。

        "TESTÜTEST".replaceAll( "\\W", "" )

为我返回“TESTTEST”。我想要的只是删除所有真正的非“单词字符”。任何方式都可以做到这一点，而没有像

那样的东西

         "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"

只知道我忘了ô？

Answer 1

使用[^\p{L}\p{Nd}]+ - 这匹配所有（Unicode）字符既不是字母也不是（十进制）数字。

在Java中：

String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");

修改

我将\p{N}更改为\p{Nd}，因为前者还匹配某些数字符号，例如¼;后者没有。请在regex101.com上查看。

Answer 2

当我碰到这个帖子时，我试图达到完全相反的目的。我知道它已经很老了，但这仍然是我的解决方案。您可以使用块，请参阅here。在这种情况下，编译以下代码（使用正确的导入）：

> String s = "äêìóblah"; 
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

您应该看到以下输出：

真

#blah

最佳，

Answer 3

有时您不想简单地删除字符，只需删除重音符号即可。我提出了以下实用程序类，每当我需要在URL中包含String时，我在Java REST Web项目中使用它：

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
    private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
    private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
            "sz" };

    /**
     * Normalizes a String by removing all accents to original 127 US-ASCII
     * characters. This method handles German umlauts and "sharp-s" correctly
     * 
     * @param s
     *            The String to normalize
     * @return The normalized String
     */
    public static String normalize(String s) {
        if (s == null)
            return null;

        String n = null;

        n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
        n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

        return n;
    }

    /**
     * Returns a clean representation of a String which might be used safely
     * within an URL. Slugs are a more human friendly form of URL encoding a
     * String.
     * <p>
     * The method first normalizes a String, then converts it to lowercase and
     * removes ASCII characters, which might be problematic in URLs:
     * <ul>
     * <li>all whitespaces
     * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
     * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
     * </ul>
     * 
     * @param s
     *            The String to slugify
     * @return The slugified String
     * @see #normalize(String)
     */
    public static String slugify(String s) {

        if (s == null)
            return null;

        String n = normalize(s);
        n = StringUtils.lowerCase(n);
        n = n.replaceAll("[\\s.:;&=<>/]", "");

        return n;
    }
}

作为一名德语演讲者，我也包括对德语变音符号的正确处理 - 该列表应该易于扩展到其他语言。

HTH

编辑：请注意可能在网址中包含返回的字符串是不安全的。你应该至少对它进行HTML编码以防止XSS攻击。

Answer 4

嗯，这是我最终解决的一个解决方案，但我希望有一个更优雅的解决方案......

StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
    char tmpChar = name.charAt( i );
    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
        result.append( tmpChar );
    }
}

result最终得到了预期的结果......

Answer 5

您可能需要remove the accents and diacritic signs first，然后在每个字符位置检查“简化”字符串是否为ascii字母 - 如果是，原始位置应包含单词字符，如果不是，则可以将其删除。

Answer 6

您可以使用apache中的StringUtils

从Java中的字符串中删除所有非“单词字符”，留下重音字符？

6 个答案: