Hi community,
Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as the compression codec offered by carbon. Some benefits of having Gzip compression codec are : 1. Gzip offers reduced file size compared to other codec like snappy but at the cost of processing speed. 2. Gzip is suitable for users who have cold data i.e. data which which is stored permanently and will be queried rarely. I have created the jira issue for the same. https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the design document there. Any suggestions regarding this are welcomed by the community. Regards, Shardul |
+1
I have some question: 1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray, uncompresssIntArray? 2. Does Gzip need uncompress size to allocate the target array before uncompressing? 3. Does you solution require data copy? Regards, Jacky > 在 2018年10月12日,下午6:49,shardul singh <[hidden email]> 写道: > > Hi community, > Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as > the compression codec offered by carbon. > Some benefits of having Gzip compression codec are : > > 1. Gzip offers reduced file size compared to other codec like snappy but > at the cost of processing speed. > 2. Gzip is suitable for users who have cold data i.e. data which which > is stored permanently and will be queried rarely. > > I have created the jira issue for the same. > https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the > design document there. > Any suggestions regarding this are welcomed by the community. > > Regards, > Shardul > |
Hi,
1. No it doesn't support UncompressShort/Int, Short/Int array needs to be typecasted to byte array and then passed for compression.For uncompress we get the result as byte array that need to be typecasted to Short/Int array depending on requirement. 2. No it doesn't need uncompressed size. 3. Yes data copy is required during uncompression to avoid compressed data getting modified. Also required if the offset of the data is not 0. Regards, Shardul On Thu, Oct 18, 2018 at 9:09 AM Jacky Li <[hidden email]> wrote: > +1 > > I have some question: > 1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray, > uncompresssIntArray? > 2. Does Gzip need uncompress size to allocate the target array before > uncompressing? > 3. Does you solution require data copy? > > Regards, > Jacky > > > 在 2018年10月12日,下午6:49,shardul singh <[hidden email]> 写道: > > > > Hi community, > > Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as > > the compression codec offered by carbon. > > Some benefits of having Gzip compression codec are : > > > > 1. Gzip offers reduced file size compared to other codec like snappy > but > > at the cost of processing speed. > > 2. Gzip is suitable for users who have cold data i.e. data which which > > is stored permanently and will be queried rarely. > > > > I have created the jira issue for the same. > > https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the > > design document there. > > Any suggestions regarding this are welcomed by the community. > > > > Regards, > > Shardul > > > > > > |
Comment inline
> 在 2018年10月18日,下午1:49,shardul singh <[hidden email]> 写道: > > Hi, > 1. No it doesn't support UncompressShort/Int, Short/Int array needs to be > typecasted to byte array and then passed for compression.For uncompress we > get the result as byte array that need to be typecasted to Short/Int array > depending on requirement. In PR2728, xuchuanyin modified the compress/uncompress interface to keep only compressByte, and modified the ColumnPage to use ByteArray instead of primitive data arrays, if this help you simplify the GZip PR, we should work on PR2728 and merge it. What do you think? > 2. No it doesn't need uncompressed size. > 3. Yes data copy is required during uncompression to avoid compressed data > getting modified. Also required if the offset of the data is not 0. Please check whether Gzip offers uncompression method that accept ByteBuffer, maybe we can move the position of the ByteBuffer and Gzip can uncompress start from the position we give? I remember ZSTD supports like this. > > Regards, > Shardul > > On Thu, Oct 18, 2018 at 9:09 AM Jacky Li <[hidden email]> wrote: > >> +1 >> >> I have some question: >> 1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray, >> uncompresssIntArray? >> 2. Does Gzip need uncompress size to allocate the target array before >> uncompressing? >> 3. Does you solution require data copy? >> >> Regards, >> Jacky >> >>> 在 2018年10月12日,下午6:49,shardul singh <[hidden email]> 写道: >>> >>> Hi community, >>> Currently carbon supports SNAPPY and ZSTD codec. Proposing to add Gzip as >>> the compression codec offered by carbon. >>> Some benefits of having Gzip compression codec are : >>> >>> 1. Gzip offers reduced file size compared to other codec like snappy >> but >>> at the cost of processing speed. >>> 2. Gzip is suitable for users who have cold data i.e. data which which >>> is stored permanently and will be queried rarely. >>> >>> I have created the jira issue for the same. >>> https://issues.apache.org/jira/browse/CARBONDATA-3005 and will add the >>> design document there. >>> Any suggestions regarding this are welcomed by the community. >>> >>> Regards, >>> Shardul >>> >> >> >> >> > |
In reply to this post by shardul singh
For all column compression, I have a problem about onHeap/offHeap compression
in carbon. Snappy works different from zstd. I wonder whether this problem exists for gz or not. And how can we unify processing of different compressor? ## Problem ## Recently I'm trapped in a problem when looking at zstd unsafe compress in carbon. Since zstd-jni 1.3.6-1 release and supports Zstd.compressUnsafe(outputAddress, outputSize, inputAddress, inputSize, COMPRESS_LEVEL), we can enable to do zstd unsafe compression in UnsafeFixLengthColumnPage.java. However, the query result is wrong for columns used UnsafeFixLengthColumnPage. ## Analyse ## I found the root cause is LITTLE_ENDIAN/BIG_ENDIAN. For onHeap/safe loading, zstd compressor in carbon always uses byte[], by converting from/to different datatype. --- This case is fine. For offheap/unsafe loading, UnsafeFixLengthColumnPage calls CarbonUnsafe.getUnsafe() to put value into memory(e.g. putShort, putInt, putDouble...) and then do the rawcompress. --- The key point here is that unsafe.putXXX is related to endian Take a simplified example for zstd in carbon: Input: int[] {1,2} onheap: convert to byte[] 00010002 -> compressByte offheap: putInt 10002000 -> rawCompress decompression: unCompressByte -> convert to int[] onheap/offheap just affects compress process, carbon uses same code to decompress, so for above example the decompress result is different. ## What about Snappy ## So, why snappy can deal with unsafe perfectly? I don't familiar with the jni coding. But after a glance to it, I think the gap is that snappy offers API for datatypes like compress(int[] var0) and uncompressIntArray(byte[] var0) and its implementation uses GetDirectBufferAddress(is this endian related?) take the same example above to apply on snappy in carbon: Input: int[] {1,2} onheap: compressInt 10002000 offheap: putInt 10002000 decompression: uncompressIntArray ## simple code to check snappy ## ``` int[] inData = new int[1]; inData[0] = 1; byte[] safe_out = Snappy.compress(inData); // uncompress byte[] check1 = Snappy.uncompress(safe_out); System.out.println(ByteBuffer.wrap(check1).getInt(0)); // 16777216 // uncompressIntArray int[] check2 = Snappy.uncompressIntArray(safe_out); System.out.println(check2[0]); // 1 ``` ## Note ## My nativeOrder is LITTLE_ENDIAN and java default is BIG_ENDIAN. ----- Regards Manhua -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Regards
Manhua |
Free forum by Nabble | Edit this page |