Hi all,
Here I am to explain the modification of 'Support Zstd as Column Compressor'(PR2628). Please give your feedback if you have problems. # BACKGROUND Zstd is compressor that have higher ratio than Snappy while has similar compression/decompression speed (litte worse than snappy). This compressor has been used in other products in our company and is regarded as a replacement for snappy with higher compression ratio and acceptable decreasing in decompression. So we want to introduce Zstd compressor to compressor the column values in final carbondata file. (The last sentence is meant to distinguish it from the compressor for sort temp files.) # DESIGN&MODIFICATIONS 1. The metadata of the compressor for a column is stored in DataChunk3. CarbonData defined the compressor in thrift. Previously it only supported Snappy, so I 1.1 add Zstd in the thrift. 1.2 add ZstdCompressor and update the CompressorFactory 2. For data loading, before the loading starts, Carbondata will get the compressor from system property file and pass the compressor info to the next procedures, so that all the pages in all the blocklets in this load will use the same compressor. This will avoid the problem if we changed the property in concurrent mode. For this modification, we will 2.1 add the compressor info in CarbonLoadModel and CarbonFactDataHandlerModel. 2.2 add the compressor as a member for ColumnPage 2.3 add the compressor as an input parameter when creating a ColumnPage 3. For data querying, Carbondata will get the compressor info from DataChunk3 in the chunk. Then it will use that compressor to decompress the content. This means that we will 3.1 get the compressor from the dimension/measure chunk during reading 4. For others that use compressor, such as compress the configuration, we will use snappy just like before. This means we will 4.1 explicitly specify the snappy as the compressor for it 5. For legacy store, it use snappy, so we just 5.1 specify snappy as the compressor while reading the legacy store. 6. For streaming segment, it also compress the (streaming) blocklets. Because files in streaming segment did not store the compressor info before, so we 6.1 add the compressor in the FileHeader in thrift file 6.2 During loading for streaming segment, if the stream file already exists, we will read the compressor info from the FileHeader of the file and reuse that compressor. 6.3 If the stream file does not exist, we will read the compressor info from system property and set it to the FileHeader. 6.4 For streaming legacy store, it does not have compressor in the FileHeader, in this case, we will use snappy to write&read the following streaming blocklets. 7. For compaction and handoff, since it reuse the read procedure, so no extra modification has been made for this. And we still 7.1 add test case for it. Please refer to the 'TestLoadDataWithCompression.scala'. 8. For extension for other compressors, it's simple to add a new one. Take LZ4 for example, the following changes are required: 8.1 Add LZ4 in thrift 8.2 Add Lz4Compressor 8.3 Add Lz4Compressor to the compressor factory -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
As a result of the latest implementation, I store the compressor name in the
thrift and the old enum for compression_codec has been deprecated. This makes it easier to support other compressors. Take LZ4 for example, the following changes are required: 1 Implement Lz4Compressor 2 Add Lz4Compressor to the compressor factory as native supported compressor -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Great, thanks for your effort.
For the lz4 task,I checked lz4 compressor (lz4-java), and found it needs the decompressed size before decompressing the data. In CarbonData V3 format, we have stored the uncompressed size of data page in ChunkCompressionMeta.total_uncompressed_size in the data file for every page. So to implement Lz4Compressor in carbon, I think we need to use this information in the file and maybe compressor interface need to changed to add this parameter to unCompressXXX interface so that lz4 can use it. Regards, Jacky > 在 2018年9月12日,下午8:35,xuchuanyin <[hidden email]> 写道: > > As a result of the latest implementation, I store the compressor name in the > thrift and the old enum for compression_codec has been deprecated. This > makes it easier to support other compressors. Take LZ4 for example, the > following changes are required: > 1 Implement Lz4Compressor > 2 Add Lz4Compressor to the compressor factory as native supported compressor > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Yeah. Zstd and Snappy knows the size of decompress size from the compressed
data, but LZ4 don't. I find a link to describe this: https://github.com/lz4/lz4-java/issues/26 To work around with LZ4, you can go with your proposal and save&use the decompress size in the meta. But I'd like to wrap the LZ4 implementation by 1. adding original size when we return the compressed content 2. extracting the original size when we want to decompress the content. In this way, we can make the API stable. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Jacky Li
Snappy and Zstd both know the decompress size of the content since they stored that size along with the compressed content. But LZ4 didn't do this, you can refer to the issue#26 in the lz4-java github page.
To work around this, You can store the original size in metadata for decompression. But I would like to store the size along with the compressed content to keep the interface stable. 1. While compressing, you can return the compressed content prefixed with original size 2. While decompressing, you can extract the original size and init the dest buffer Please let me know your choice. |
Free forum by Nabble | Edit this page |