Login  Register

Re: [1.5.2] Gzip Compression Support

Posted by manhua on Oct 25, 2018; 3:49am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/1-5-2-Gzip-Compression-Support-tp64960p66121.html

For all column compression, I have a problem about onHeap/offHeap compression
in carbon.
Snappy works different from zstd. I wonder whether this problem exists for
gz or not.
And how can we unify processing of different compressor?

## Problem ##
Recently I'm trapped in a problem when looking at zstd unsafe compress in
carbon.

Since zstd-jni 1.3.6-1 release and supports
Zstd.compressUnsafe(outputAddress, outputSize, inputAddress, inputSize,
COMPRESS_LEVEL), we can enable to do zstd unsafe compression in
UnsafeFixLengthColumnPage.java.

However, the query result is wrong for columns used
UnsafeFixLengthColumnPage.


## Analyse ##
I found the root cause is LITTLE_ENDIAN/BIG_ENDIAN.

For onHeap/safe loading, zstd compressor in carbon always uses byte[], by
converting from/to different datatype.   --- This case is fine.
For offheap/unsafe loading, UnsafeFixLengthColumnPage calls
CarbonUnsafe.getUnsafe() to put value into memory(e.g. putShort, putInt,
putDouble...) and then do the rawcompress.    --- The key point here is that
unsafe.putXXX is related to endian

Take a simplified example for zstd in carbon:
Input: int[] {1,2}
onheap:  convert to byte[] 00010002 -> compressByte
offheap: putInt 10002000 -> rawCompress

decompression: unCompressByte -> convert to int[]
       
onheap/offheap just affects compress process, carbon uses same code to
decompress, so for above example the decompress result is different.

## What about Snappy ##
So, why snappy can deal with unsafe perfectly?
I don't familiar with the jni coding. But after a glance to it, I think the
gap is that snappy offers API for datatypes like compress(int[] var0) and
uncompressIntArray(byte[] var0) and its implementation uses
GetDirectBufferAddress(is this endian related?)

take the same example above to apply on snappy in carbon:
Input: int[] {1,2}
onheap:  compressInt 10002000
offheap: putInt 10002000

decompression: uncompressIntArray

## simple code to check snappy ##
```
    int[] inData = new int[1];
    inData[0] = 1;
    byte[] safe_out = Snappy.compress(inData);

    // uncompress
    byte[] check1 = Snappy.uncompress(safe_out);
    System.out.println(ByteBuffer.wrap(check1).getInt(0));  // 16777216

    // uncompressIntArray
    int[] check2 = Snappy.uncompressIntArray(safe_out);
    System.out.println(check2[0]);  // 1
```

## Note ##
My nativeOrder is LITTLE_ENDIAN and java default is BIG_ENDIAN.





-----
Regards
Manhua
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Regards
Manhua