将UTF8压缩为UTF16,解压缩不起作用

时间:2014-08-12 09:06:10

标签: javascript encoding utf-8 compression utf-16

因为JavaScript将localStorage中的字符串存储为UTF-16,但我的大部分数据都在UTF-8范围内,我试图在存储之前将其压缩为UTF-16。

我在网上找到了一个用于压缩的算法; Text re-encoding for optimising storage capacity in the browser,我使用了网页上列出的第三个。不幸的是,他们没有提供一个解压缩数据。

它通过base64对我的数据进行编码,因为base64可以用6位(其中'A' = 1'/' = 63)和UTF-16 16位可靠地描述,我可以将每个包含多个base64字符UTF-16字符。

例如,base64代码段R2lIdaXh可以分为6位字节,如下所示:

R      2      l      I      d      a      X      h
17     54     37     8      29     26     23     33
010001 110110 100101 001000 011101 011010 010111 100001

然后通过串联编码为UTF16,如下所示:

䝩               䡵                ꗡ
18281            18549            42465
0100011101101001 0100100001110101 1010010111100001

因为16 % 6 = 4在多个UTF16字符之间共享了一些base64字符。

用于压缩的算法是:

/*
 * Converts a string to base64 then re-encodes it as UTF-16
 *
 * @param {string} string The string to be re-encoded
 * @return {string}
 */
function compress(string) {
  var output = '';
  var encoded = base64.encode(string);
  var bits = 16;
  var charCode = 0;
  var rem = 0;
  var len = encoded.length;
  for (var i = 0; i < len; i++) {
    var char = encoded[i];
    if (bits > 6) {
      /* Enough bits left to store this byte */
      bits -= 6;
      /* Shift the bits left into their position and sum them to the total */
      charCode += base64.indices[char] << bits;
    } else {
      /* This byte will overflow */
      rem = 6 - bits;
      charCode += base64.indices[char] >> rem;
      output += String.fromCharCode(charCode);
      charCode = (base64.indices[char] % rem) << (16 - rem);
      bits = 16 - rem;
    }
  }
  return output;
}

然后我使用以下函数进行解压缩:

/*
 * Reverses the compress function.
 * Converts a UTF-16 string to 6-bit character codes, maps them to base64 then decodes the base64.
 *
 * @param {string} string The string to be re-encoded
 * @return {string}
 */
function decompress(string) {
  var output = '';
  var byte = 0;
  var rem = 0;
  var bits = 0;
  var charCode = 0;
  var len = string.length;
  for (var i = 0; i < len; i++) {
    bits += 16;
    byte += string.charCodeAt(i);
    while (bits >= 6) {
      bits -= 6;
      /* Retrieve the left-most 6 relevant bits from the byte */
      charCode = byte >> bits;
      /* Map the number to base64 */
      output += base64.chars[charCode];
      /* Remove the retrieved bits from the byte */
      byte -= charCode << bits;
    }
    if (bits !== 0) {
      /* Push the remaining bits to the left of the next byte */
      byte = byte << 16;
    }
  }
  return base64.decode(output);
}

虽然这种方法接近但是它们有些错误,例如,

var test = 'Gee whiz! How do you do? Would you like some tea?';
var compressed = compress(test);
var decompressed = decompress(compressed);
console.log('Input:', test);
console.log('Compressed:', compressed);
console.log('Decompressed:', decompressed);

结果如下:

Input:       Gee whiz! How do you do? Would you like some tea?
Compressed:  䝥攠㝨楺℠࡯眠摯⁹潵⁤⼿⁗潵Ɽ⁹潵楫攠㍯浥⁴╡㼐
Decompresed: Gee 7hiz! ow do you d/? Wou,d you like 3ome t%a?

如果我们比较base64输入和输出,我们得到以下结果:

Input:  R2VlIHdoaXohIEhvdyBkbyB5b3UgZG8/IFdvdWxkIHlvdSBsaWtlIHNvbWUgdGVhPw==
Output: R2VlIDdoaXohIAhvdyBkbyB5b3UgZC8/IFdvdSxkIHlvdSBsaWtlIDNvbWUgdCVhPx
Errors:      6       14              30      38              54      62  66

我开始认为错误可能在compress函数中,但我不确定,有人能看出解压缩数据不匹配的原因吗?

1 个答案:

答案 0 :(得分:0)

LZString仅为localStorage压缩字符,我强烈建议在这种情况下使用它。 我下面的代码是专为学习而设计的,并且可以使用,但是请不要使用它。

无论如何,通过阅读我的答案,您可以学到更多有关压缩的知识。我的代码仅用于客户端和存储在localStorage中。

您的代码使用btoa,但是我发现使用它非常困难,因为每天使用的许多字符都无法使用,并且我们收到了错误消息。我选择使用encodeURIComponent来代替它。

only problem出现在代码介于55296和57343之间的字符中,我们也收到了错误,但是我在代码中绕过了这个顺序,并进行了一些测试。它可以很好地处理重音符号和其他特殊字符,并且输出应该与输入相同。

由于将A-Z字符统一为一个字符,因此它会在几种常见文本中产生压缩。但是,在某些类型的字符中,encodeURIComponent会将它们转换为序列,有时不存在压缩,并且输出的字符要多于输入的字符。

但通常,几乎所有情况下压缩率约为30%。也就是说,我完成了我的目标,因为它确实进行了压缩,但是我将使用LZString,因为它非常优越,并且可以存储更多信息。 我的代码似乎更快,但它们只有毫秒,不值得使用!

// compress-utf8.js v1
function compressUtf8InnerOfUtf16(text) {
    let output = '';
    try{
        text = encodeURIComponent(text);
    }catch(e){
        /*
        this try is practically optional because this error is very rare:
        the encodeURI cannot convert characters with codes between 55296 and 57343
        try it yourself: encodeURIComponent(String.fromCharCode(55296))
        https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#Description
        */
        let newText = '';
        let textLength = text.length;
        for(let i = 0; i < textLength; i++){
            let code = text.charCodeAt(i);
            if( code >= 55296 && code <= 57343 ){
                newText += String.fromCharCode(0, 1, 3) + code;
                continue;
            };
            newText += text[i];
        };
        text = encodeURIComponent(newText);
    };
    text += text.length % 2 ? String.fromCharCode(0) : ''; // if is odd
    let textLength = text.length;
    for (let i = 0; i < textLength; i += 2) {
        let code = text.charCodeAt(i) * 256 + text.charCodeAt(i + 1);
        output += String.fromCharCode(code);
    };
    return output;
};

function decompressUtf8InnerOfUtf16(text) {
    let output = '',
        textLength = text.length,
        lastCharacter = '';
    for (let i = 0; i < textLength; i++) {
        let code = text.charCodeAt(i),
            firstCharacter = Math.floor(code / 256);
        lastCharacter = code % 256;
        output += String.fromCharCode(firstCharacter, lastCharacter);
    };
    output = lastCharacter !== 0 ? output : output.substring(0, output.length - 1);
    output = decodeURIComponent(output);
    // the regexp and if block are only necessary if try block exist in compress function
    let regexp = new RegExp(String.fromCharCode(0, 1, 3) + '\\d{5}', 'g');
    if(output.match(regexp)){
        output = output.replace(regexp, function(match){
            return String.fromCharCode(match.substring(3,7));
        });
    };
    return output;
};

在此示例中,我们可以看到该重音有效,并且在这种情况下,因为有许多特殊字符,所以字符数有所增加

let input = 'ÀÁÂÃÄÅàáâãäåÇĆĈḈçćĉḉÐðÈÉÊËèéêëǴĜǵĝĤĥÌÍÎÏìíîïĴĵḰḱĹĺḾḿÑǸŃñǹńÒÓÔÕÖòóôõöṔṕŔŕŠŜŚšŝś    ÙÚÛÜùúûüṼṽẂẀŴẃẁŵÝŸỲỸŶýÿỳỹŷŽẐŹžẑź&';
let compressed = compressUtf8InnerOfUtf16(input);
let decompressed = decompressUtf8InnerOfUtf16(compressed);
console.log(compressed);
console.log(decompressed);
console.log(input === decompressed);
console.log( input.length, compressed.length );

在此示例中,为简单文本,此处的压缩效果与大多数情况下一样

let input = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque dignissim efficitur gravida. Sed vitae augue dolor. Nulla semper, dolor eu volutpat ultricies, diam odio convallis velit, lobortis accumsan urna magna quis magna. In hac habitasse platea dictumst. Maecenas ultricies turpis eget nisi ultricies feugiat. Pellentesque faucibus, turpis in ornare cursus, orci leo aliquet ligula, sit amet faucibus dui diam egestas arcu. Integer rhoncus tortor nec mauris feugiat efficitur. Nam viverra justo quis elementum semper. Nullam nec purus gravida, pretium risus eget, vulputate dolor. Nullam porta aliquet odio, nec commodo enim. Curabitur ultrices luctus condimentum. Aliquam tristique congue ipsum, ac semper odio venenatis id.';
let compressed = compressUtf8InnerOfUtf16(input);
let decompressed = decompressUtf8InnerOfUtf16(compressed);
console.log(compressed);
console.log(decompressed);
console.log(input === decompressed);
console.log( input.length, compressed.length );

我部分使用了article中解释的代码。

相关问题