Apache CarbonData Dev Mailing List archive

Support Zstd as Column Compressor

Posted by xuchuanyin on Aug 27, 2018; 3:47am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Support-Zstd-as-Column-Compressor-tp60417.html

Hi all,
Here I am to explain the modification of 'Support Zstd as Column
Compressor'(PR2628). Please give your feedback if you have problems.

# BACKGROUND

Zstd is compressor that have higher ratio than Snappy while has similar
compression/decompression speed (litte worse than snappy). This compressor
has been used in other products in our company and is regarded as a
replacement for snappy with higher compression ratio and acceptable
decreasing in decompression.
So we want to introduce Zstd compressor to compressor the column values in
final carbondata file. (The last sentence is meant to distinguish it from
the compressor for sort temp files.)

# DESIGN&MODIFICATIONS

1. The metadata of the compressor for a column is stored in DataChunk3.
CarbonData defined the compressor in thrift. Previously it only supported
Snappy, so I
1.1 add Zstd in the thrift.
1.2 add ZstdCompressor and update the CompressorFactory
2. For data loading, before the loading starts, Carbondata will get the
compressor from system property file and pass the compressor info to the
next procedures, so that all the pages in all the blocklets in this load
will use the same compressor. This will avoid the problem if we changed the
property in concurrent mode.
For this modification, we will
2.1 add the compressor info in CarbonLoadModel and
CarbonFactDataHandlerModel.
2.2 add the compressor as a member for ColumnPage
2.3 add the compressor as an input parameter when creating a ColumnPage

3. For data querying, Carbondata will get the compressor info from
DataChunk3 in the chunk. Then it will use that compressor to decompress the
content. This means that we will
3.1 get the compressor from the dimension/measure chunk during reading

4. For others that use compressor, such as compress the configuration, we
will use snappy just like before. This means we will
4.1 explicitly specify the snappy as the compressor for it

5. For legacy store, it use snappy, so we just
5.1 specify snappy as the compressor while reading the legacy store.

6. For streaming segment, it also compress the (streaming) blocklets.
Because files in streaming segment did not store the compressor info before,
so we
6.1 add the compressor in the FileHeader in thrift file
6.2 During loading for streaming segment, if the stream file already exists,
we will read the compressor info from the FileHeader of the file and reuse
that compressor.
6.3 If the stream file does not exist, we will read the compressor info from
system property and set it to the FileHeader.
6.4 For streaming legacy store, it does not have compressor in the
FileHeader, in this case, we will use snappy to write&read the following
streaming blocklets.

7. For compaction and handoff, since it reuse the read procedure, so no
extra modification has been made for this. And we still
7.1 add test case for it. Please refer to the
'TestLoadDataWithCompression.scala'.

8. For extension for other compressors, it's simple to add a new one. Take
LZ4 for example, the following changes are required:
8.1 Add LZ4 in thrift
8.2 Add Lz4Compressor
8.3 Add Lz4Compressor to the compressor factory

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/