下载xml,删除bom并编码utf8

时间:2014-01-27 14:22:31

标签: java xml utf-8

我正在从FTP服务器下载XML。我必须为我的SAX Parser做好准备。为此,我需要删除BOM字节并将其编码为UTF-8。但不知何故,它不适用于每个文件。

这是我的两个函数的代码:

public static void copy(File src, File dest){

    try {
        byte[] data = Files.readAllBytes(src.toPath());

        writeAsUTF8(dest, skipBom(data));

    } catch (IOException e) {
        e.printStackTrace();
    }
}


private static void writeAsUTF8(File out, byte[] data){

    try {

        FileOutputStream outStream = new FileOutputStream(out);
        OutputStreamWriter outUTF = new OutputStreamWriter(outStream,"UTF8");

        outUTF.write(new String(data, "UTF8"));
        //outUTF.write(new String(data));
        outUTF.flush();
        outStream.close();
        outUTF.close();
    }
    catch(Exception ex){
        ex.printStackTrace();
    }
}

    private static byte[] skipBom(byte[] data){

    int skipBytes = getBomSize(data);

    byte[] tmp = new byte[data.length - skipBytes];

    for(int x = 0; x < tmp.length; x++){
        tmp[x] = data[x + skipBytes];
    }

    return tmp;
}

任何想法我做错了什么?

3 个答案:

答案 0 :(得分:1)

简化。

    writeAsUTF8(dest, data);



try {
    int BOM_LENGTH = "\uFFFE".getBytes(StandardCharsets.UTF_8);
    if (!new String(data, 0, BOM_LENGTH).equals("\uFFFE")) {
        BOM_LENGTH = 0;
    }
    FileOutputStream outStream = new FileOutputStream(out);
    outStream.write(data, BOM_LENGTH, data.length - BOM_LENGTH));
    outStream.close();
}
catch(Exception ex){
    ex.printStackTrace();
}

检查是否存在BOM(U + FFFE)。只读全部字符串会更简单:

String xml = new String(data, StandardCharsets.UTF_8);
xml = xml.replaceFirst("^\uFFFE", "");

使用Charset而不是String编码参数意味着要捕获一个Exception:UnsupportedEncodingException(IOException)。


检测XML编码:

String xml = new String(data, StandardCharsets.ISO_8859_1);
String encoding = xml.replaceFirst(
        "(?s)^.*<\\?xml.*encoding=([\"'])([\\w-]+)\\1.*\\?>.*$",
        "$2");

if (encoding.equals(xml)) {
    encoding = "UTF-8";
}
xml = new String(data, encoding);
xml = xml.replaceFirst("^\uFFFE", "");

答案 1 :(得分:0)

为什么要删除BOM字节?您只需要将文件读取到包含文件编码的字符串,然后使用UTF-8编码将字符串写入文件。

答案 2 :(得分:0)

我无法弄清楚你的代码有什么问题。我前段时间遇到过同样的问题,我用下面的代码来做。首先,以下函数读取跳过第一个字节的文件。如果您确定所有文件都有BOM,那么这当然才有意义。

public byte[] load (File inputFile, int lines) throws Exception {

    try (BufferedReader reader
        = new BufferedReader(
            new InputStreamReader(
                new FileInputStream(inputFile), "UTF-8")))
    {
        // Discard the Byte Order Mark
        int firstByte = reader.read();

        String line = null;
        int lineCount = 0;

        StringBuilder builder = new StringBuilder();
        while( lineCount <= lines && (line = reader.readLine()) != null ) {
            lineCount += 1;
            builder.append(line + "\n");
        }
    }

    return builder.toString().getBytes();
}

您可以重写上述功能,将数据写回UTF-8中的另一个文件。我偶尔使用以下方法转换磁盘上的文件,将其从ISO转换为UTF-8:

public static void convertToUTF8 (Path p) throws Exception {
    Path docPath = p;
    Path docPathUTF8 = docPath;

    InputStreamReader in = new InputStreamReader(new FileInputStream(docPath.toFile()), StandardCharsets.ISO_8859_1);

    CharBuffer cb = CharBuffer.allocate(100 * 1000 * 1000);
    int c = -1;

    while ( (c = in.read()) != -1 ) {
        cb.put((char) c);
    }
    in.close();

    OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(docPathUTF8.toFile()), StandardCharsets.UTF_8);

    char[] x = new char[cb.position()];
    System.arraycopy(cb.array(), 0, x, 0, x.length);

    out.write(x);
    out.flush();
    out.close();
}