Class UnicodeCompressor


  • public final class UnicodeCompressor
    extends Object
    A compression engine implementing the Standard Compression Scheme for Unicode (SCSU) as outlined in Unicode Technical Report #6.

    The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.

    USAGE

    The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:

      String s = ... ; // get string from somewhere
      byte [] compressed = UnicodeCompressor.compress(s);
     

    The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:

      // Compress an array "chars" of length "len" using a buffer of 512 bytes
      // to the OutputStream "out"
    
      UnicodeCompressor myCompressor         = new UnicodeCompressor();
      final static int  BUFSIZE              = 512;
      byte []           byteBuffer           = new byte [ BUFSIZE ];
      int               bytesWritten         = 0;
      int []            unicharsRead         = new int [1];
      int               totalCharsCompressed = 0;
      int               totalBytesWritten    = 0;
    
      do {
        // do the compression
        bytesWritten = myCompressor.compress(chars, totalCharsCompressed, 
                                             len, unicharsRead,
                                             byteBuffer, 0, BUFSIZE);
    
        // do something with the current set of bytes
        out.write(byteBuffer, 0, bytesWritten);
    
        // update the no. of characters compressed
        totalCharsCompressed += unicharsRead[0];
    
        // update the no. of bytes written
        totalBytesWritten += bytesWritten;
    
      } while(totalCharsCompressed < len);
    
      myCompressor.reset(); // reuse compressor
     
    Author:
    Stephen F. Booth
    See Also:
    UnicodeDecompressor