Question

假设我有以下字符串：RückrufinsAusland 我需要将其插入到最大大小为10的数据库中。我在java中做了一个普通的子字符串，它在中提取了这个字符串Rückruf，这是10个字符。当它试图插入此列时，我得到以下oracle错误：

java.sql.SQLException：ORA-12899：值对于列来说太大了 “WAEL”。“TESTTBL”。“DESC”（实际：11，最大值：10）原因是数据库有一个AL32UTF8字符集，因此ü将采用2个字符。

我需要在java中编写一个执行此子字符串的函数，但考虑到ü占用2个字节，因此在这种情况下返回的子字符串应该是Rückrufi（9个字符）。有什么建议吗？

Answer 1

您可以在java中将字符串转换为字节数组来计算String的正确长度。

作为示例，请参阅以下代码：

System.out.println("Rückruf i".length()); // prints 9 
System.out.println("Rückruf i".getBytes().length); // prints 10

如果当前字符集不是UTF-8，请将代码替换为：

System.out.println("Rückruf i".length()); // prints 9 
System.out.println("Rückruf i".getBytes("UTF-8").length); // prints 10

如果需要，您可以使用您喜欢的字符集替换UTF-8，以测试该字符集中字符串的长度。

Answer 2

如果它必须是Java，你可以将字符串解析为字节并修剪数组的长度。

        String s = "Rückruf ins Ausland";
        byte[] bytes = s.getBytes("UTF-8");
        byte[] bytes2 = new byte[10];
        System.arraycopy(bytes, 0, bytes2, 0, 10);
        String trim = new String(bytes2, "UTF-8");

Answer 3

我认为在这种情况下最好的选择是在数据库级别进行子串，使用Oracle SUBSTR函数直接在SQL QUERY上。

例如：

INSERT INTO ttable (colname) VALUES (SUBSTR( ?, 1, 10 ))

感叹号代表通过JDBC发送的SQL参数。

Answer 4

如果你想修剪Java中的数据，你必须编写一个使用db charset修剪字符串的函数，类似于这个测试用例：

package test;

import java.io.UnsupportedEncodingException;

public class TrimField {

    public static void main(String[] args) {
        //UTF-8 is the db charset
        System.out.println(trim("Rückruf ins Ausland",10,"UTF-8"));
        System.out.println(trim("Rüückruf ins Ausland",10,"UTF-8"));
    }

    public static String trim(String value, int numBytes, String charset) {
        do {
            byte[] valueInBytes = null;
            try {
                valueInBytes = value.getBytes(charset);
            } catch (UnsupportedEncodingException e) {
                throw new RuntimeException(e.getMessage(), e);
            }
            if (valueInBytes.length > numBytes) {
                value = value.substring(0, value.length() - 1);
            } else {
                return value;
            }
        } while (value.length() > 0);
        return "";

    }

}

Answer 5

以下可怕的情况是通过完整的Unicode代码点遍历整个字符串，所以也是char对（代理代码点）。

public String trim(String s, int length) {
    byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    if (bytes.length <= length) {
        return s;
    }
    int totalByteCount = 0;
    for (int i = 0; i < s.length(); ) {
        int cp = s.codePointAt(i);
        int n = Character.charCount(cp);
        int byteCount = s.substring(i, i + n)
                .getBytes(StandardCharsets.UTF_8).length;
        if (totalByteCount + byteCount) > length) {
            break;
        }
        totalByteCount += byteCount;
        i += n;
    }
    return new String(bytes, 0, totalByteCount);
}

它仍然可以进行一些优化。

Answer 6

您需要让数据库中的编码与java字符串的编码匹配。或者，您可以使用this之类的内容转换字符串，并获取与数据库中的编码匹配的长度。这将为您提供准确的字节数。否则，你仍然只是希望编码匹配。

    String string = "Rückruf ins Ausland";

    int curByteCount = 0;
    String nextChar;
    for(int index = 0; curByteCount +  
         (nextChar = string.substr(index,index + 1)).getBytes("UTF-8").length < trimmedBytes.length;  index++){
        curByteCount += nextChar.getBytes("UTF-8").length;

    }
    byte[] subStringBytes = new byte[10];
    System.arraycopy(string.getBytes("UTF-8"), 0, subStringBytes, 0, curByteCount);
    String trimed = new String(subStringBytes, "UTF-8");

这应该这样做。它也不应该在这个过程中截断一个多字节字符。这里的假设是数据库是UTF-8编码。另一个假设是实际需要修剪字符串。

Answer 7

嘿所有ASCII字符都小于128.您可以使用以下代码。

public class Test {
    public static void main(String[] args) {
        String s= "Rückruf ins Ausland";
        int length =10;
        for(int i=0;i<s.length();i++){
            if(!(((int)s.charAt(i))<128)){
                length--;                   
            }
        }
        System.out.println(s.substring(0,length));
    }
}

您可以复制粘贴并检查它是否满足您的需要，或者它在任何地方都会中断。

如何在java中为UTF8字符串做子串？

7 个答案: