Question about RLE and DELTA encoding

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about RLE and DELTA encoding

Hao Jiang
Dear Dev team,

I have a question several days ago about RLE and DELTA encoding in
Carbon. Thank you for pointing me the source code of the implementation.

I have read through the code, and have the following understanding.
Could you please double confirm whether they are correct? Thanks!

1. RLE encoding only applies to columns with Encoding.DICTIONARY enabled
and has cardinality less than the parameter
CarbonCommonConstants.HIGH_CARDINALITY_VALUE.

I saw that the RLE encoding is applied to data in function
/BlockIndexerStorageForInt.compressDataMyOwnWay, /and is controlled by
/aggKeyBlock/, of which the value is set by /arrangeUniqueBlockType/.

If my understanding is correct, could you please share some reasons you
design the logic like this?

2. DELTA encoding is implemented in
/ValueCompressionUtil.getCompressedValues. /It doesn't do a sequential
DELTA encoding, e.g., for a list of numbers a,b,c..., encode them as a,
b-a, c-b...//Instead, it does a max-delta encoding. e.g., for a,b,c...,
assume the max value is M, encode them as M-a, M-b, M-c.

Could you please also share the thought why you choose to use this
encoding?

Thanks!

Regards,

Hao Jiang


Reply | Threaded
Open this post in threaded view
|

Re: Question about RLE and DELTA encoding

k.ashok
Hi Hao Jiang
Regarding your first question why RLE is controlled by aggKeyBlock.
There is dictionary and no-dictionary column type in carbon.
carbon sort the column data and then store it. Due to sorting index will get shuffled. Hence
for no dictionary data RLE is applied on index and not on data because it is no dictionary data.
thus in BlockIndexerStorageForInt@compressMyOwnWay, RLE happens on index. compressDataMyOwnWay
is done only for dictionary data.

Regarding your second question
Measure data are not sorted and hence sequential delta may be either big or small
for e.g
if data is 2,-3,4,-6 then sequential delta will be(-5,7,-10,-6)
Other then max min delta, we do type conversion also to reduce storage space