Apache CarbonData Dev Mailing List archive

Re: Discussion: change default compressor to ZSTD

Posted by Ajantha Bhat on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-change-default-compressor-to-ZSTD-tp91152p92078.html

Hi Jacky and Ravindra,

we have tested ZSTD vs snappy again with the latest code in 3 node spark
2.3 cluster on HDFS with TPCH 500 GB data.
Below is the summary

*1. ZSTD store is 28.8% smaller compared to snappy*
*2. Overall query time is degraded by 18.35% in ZSTD compared to snappy*
*3. Load time in ZSTD has negligible degradation of 0.7 % compared to
snappy*

Based on this, I guess we cannot use ZSTD as default due to huge
degradation in query time.

Thanks,
Ajantha

On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <[hidden email]>
wrote:

> Hi Jacky,
>
> As per the original PR
> https://github.com/apache/carbondata/pull/2628 , query performance got
> decreased by 20% ~ 50% compared to snappy. So I am concerned about the
> performance. Please better have a proper tpch performance report on the
> regular cluster like we do for every version and decide based on that.
>
> Regards,
> Ravindra.
>
> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote:
>
> > Hi Ajantha,
> >
> >
> > Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> > from the file header,
> > but I think it is better to put it in the name so that user can know the
> > compressor in the shell without reading it by launching engine.
> >
> >
> > In spark, for parquet/orc the file name written
> > is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
> >
> >
> > In PR3606, I will handle the compatibility.
> >
> >
> > Regards,
> > Jacky
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Ajantha Bhat"<[hidden email]>;
> > 发送时间: 2020年2月6日(星期四) 晚上11:51
> > 收件人: "dev"<[hidden email]>;
> >
> > 主题: Re: Discussion: change default compressor to ZSTD
> >
> >
> >
> > Hi,
> >
> > 33% is huge a reduction in store size. If there is negligible difference
> in
> > load and query time, we should definitely go for it.
> >
> > And does user really need to know about what compression is used ? change
> > in file name may be need to handle compatibility.
> > Already thrift *FileHeader, ChunkCompressionMeta* is storing the
> compressor
> > name. query time decoding can be based on this.
> >
> > Thanks,
> > Ajantha
> >
> >
> > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > >
> > > I compared snappy and zstd compressor using TPCH for carbondata.
> > >
> > >
> > > For TPCH lineitem table:
> > > carbon-zstdcarbon-snappy
> > > loading (s)5351
> > > size795MB1.2GB
> > >
> > > TPCH-query:
> > > Q14.2898.29
> > > Q212.60912.986
> > > Q314.90214.458
> > > Q46.2765.954
> > > Q523.14721.946
> > > Q61.120.945
> > > Q723.01728.007
> > > Q814.55415.077
> > > Q928.47227.473
> > > Q1024.06724.682
> > > Q113.3213.79
> > > Q125.3115.185
> > > Q1314.0811.84
> > > Q142.2622.087
> > > Q155.4964.772
> > > Q1629.91929.833
> > > Q177.0187.057
> > > Q1817.36717.795
> > > Q192.9312.865
> > > Q2011.34710.937
> > > Q2126.41628.414
> > > Q225.9236.311
> > > sum283.844290.704
> > >
> > >
> > > As you can see, after using zstd, table size is 33% reduced
> comparing
> > to
> > > snappy. And the data loading and query time difference is
> negligible.
> > So I
> > > suggest to change the default compressor in carbondata from snappy
> to
> > zstd.
> > >
> > >
> > > To change the default compressor, we need to:
> > > 1. append the compressor name in the carbondata file name. So that
> > from
> > > the file name user can know what compressor is used.
> > > For example, file name will be changed from
> > > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> > >
> >
> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> > >
> > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> > >
> > >
> > > 2. Change the compressor constant in CarbonCommonConstaint.java file
> > to
> > > use zstd as default compressor
> > >
> > >
> > > What do you think?
> > >
> > >
> > > Regards,
> > > Jacky
>
> --
> Thanks & Regards,
> Ravi
>