有没有办法修复错误的编码字符串?

时间:2012-04-11 14:12:29

标签: java character-encoding

我通过消息代理(Stomp)获取此字符串:

João

这就是它的表现:

João

有没有办法在Java中恢复它?! 谢谢!

2 个答案:

答案 0 :(得分:4)

U+00C3  Ã   c3 83   LATIN CAPITAL LETTER A WITH TILDE
U+00C2  Â   c3 82   LATIN CAPITAL LETTER A WITH CIRCUMFLEX
U+00A3  £   c2 a3   POUND SIGN
U+00E3  ã   c3 a3   LATIN SMALL LETTER A WITH TILDE

我无法确定这可能是数据(编码)转换问题。数据可能是坏的吗?

如果数据不错,那么我们必须假设你误解了编码。我们不知道原始编码,除非你做了不同的事情,否则Java的默认编码是UTF-16。我不知道在UTF-16中,任何常见编码中的João编码如何被解释为João

为了确定,我发现这个python脚本没有找到匹配。我不是完全确保它涵盖了所有编码,或者我没有错过角落案例,FWIW。

#!/usr/bin/env python                                                                                                                   
# -- coding: utf-8 --                                                                                                                   
import pkgutil
import encodings

good = u'João'
bad = u'João'

false_positives = set(["aliases"])

found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found


for x in found:
    for y in found:
        res = None
        try:
            res =  good.encode(x).decode(y)
            print res,x,y
        except:
            pass
        if not res is None:
            if res == bad:
                print "FOUND"
                exit(1)

答案 1 :(得分:2)

在某些情况下,黑客行事。但最好是防止它发生。

之前我遇到过这个问题,当我有一个正确打印了正确的标题和http内容类型和页面编码的servlet时,IE会提交用latin1编码的表单而不是正确的表单。所以我创建了一个快速的脏黑客(涉及一个请求包装器,检测并转换,如果它确实是IE)来修复它的新数据工作正常。对于已经搞砸的数据库中的数据,我使用了以下hack。

不幸的是,我的hack对你的示例字符串不起作用,但它看起来非常接近(与你的'理论原因'再现破碎的字符串相比,在你的断字符串中只是一个额外的Ã)。所以也许我对“latin1”的猜测是错误的,你应该尝试别人(比如在Tomas发布的其他链接中)。

package peter.test;

import java.io.UnsupportedEncodingException;

/**
* User: peter
* Date: 2012-04-12
* Time: 11:02 AM
*/
public class TestEncoding {
    public static void main(String args[]) throws UnsupportedEncodingException {
        //In some cases a hack works. But best is to prevent it from ever happening.
        String good = "João";
        String bad = "João";

        //this line demonstrates what the "broken" string should look like if it is reversible.
        String broken = breakString(good, bad);

        //here we show that it is fixable if broken like breakString() does it.
        fixString(good, broken);

        //this line attempts to fix the string, but it is not fixable unless broken in the same way as breakString()
        fixString(good, bad);
    }

    private static String fixString(String good, String bad) throws UnsupportedEncodingException {
        byte[] bytes = bad.getBytes("latin1"); //read the Java bytes as if they were latin1 (if this works, it should result in the same number of bytes as java characters; if using UTF8, it would be more bytes)
        String fixed = new String(bytes, "UTF8"); //take the raw bytes, and try to convert them to a string as if they were UTF8

        System.out.println("Good: " + good);
        System.out.println("Bad: " + bad);
        System.out.println("bytes1.length: " + bytes.length);
        System.out.println("fixed: " + fixed);
        System.out.println();

        return fixed;
    }

    private static String breakString(String good, String bad) throws UnsupportedEncodingException {
        byte[] bytes = good.getBytes("UTF8");
        String broken = new String(bytes, "latin1");

        System.out.println("Good: " + good);
        System.out.println("Bad: " + bad);
        System.out.println("bytes1.length: " + bytes.length);
        System.out.println("broken: " + broken);
        System.out.println();

        return broken;
    }
}

结果(使用Sun jdk 1.7.0_03):

Good: João
Bad: João
bytes1.length: 5
broken: João

Good: João
Bad: João
bytes1.length: 5
fixed: João

Good: João
Bad: João
bytes1.length: 6
fixed: Jo�£o