Apache CarbonData Dev Mailing List archive

Re: Discussion: change default compressor to ZSTD

Posted by ravipesala on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-change-default-compressor-to-ZSTD-tp91152p91230.html

Hi Jacky,

As per the original PR
https://github.com/apache/carbondata/pull/2628 , query performance got
decreased by 20% ~ 50% compared to snappy. So I am concerned about the
performance. Please better have a proper tpch performance report on the
regular cluster like we do for every version and decide based on that.

Regards,
Ravindra.

On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote:

> Hi Ajantha,
>
>
> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> from the file header,
> but I think it is better to put it in the name so that user can know the
> compressor in the shell without reading it by launching engine.
>
>
> In spark, for parquet/orc the file name written
> is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>
>
> In PR3606, I will handle the compatibility.
>
>
> Regards,
> Jacky
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ajantha Bhat"<[hidden email]>;
> 发送时间: 2020年2月6日(星期四) 晚上11:51
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: Discussion: change default compressor to ZSTD
>
>
>
> Hi,
>
> 33% is huge a reduction in store size. If there is negligible difference in
> load and query time, we should definitely go for it.
>
> And does user really need to know about what compression is used ? change
> in file name may be need to handle compatibility.
> Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
> name. query time decoding can be based on this.
>
> Thanks,
> Ajantha
>
>
> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:
>
> > Hi,
> >
> >
> > I compared snappy and zstd compressor using TPCH for carbondata.
> >
> >
> > For TPCH lineitem table:
> > carbon-zstdcarbon-snappy
> > loading (s)5351
> > size795MB1.2GB
> >
> > TPCH-query:
> > Q14.2898.29
> > Q212.60912.986
> > Q314.90214.458
> > Q46.2765.954
> > Q523.14721.946
> > Q61.120.945
> > Q723.01728.007
> > Q814.55415.077
> > Q928.47227.473
> > Q1024.06724.682
> > Q113.3213.79
> > Q125.3115.185
> > Q1314.0811.84
> > Q142.2622.087
> > Q155.4964.772
> > Q1629.91929.833
> > Q177.0187.057
> > Q1817.36717.795
> > Q192.9312.865
> > Q2011.34710.937
> > Q2126.41628.414
> > Q225.9236.311
> > sum283.844290.704
> >
> >
> > As you can see, after using zstd, table size is 33% reduced comparing
> to
> > snappy. And the data loading and query time difference is negligible.
> So I
> > suggest to change the default compressor in carbondata from snappy to
> zstd.
> >
> >
> > To change the default compressor, we need to:
> > 1. append the compressor name in the carbondata file name. So that
> from
> > the file name user can know what compressor is used.
> > For example, file name will be changed from
> > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> >
> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> >
> or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> >
> >
> > 2. Change the compressor constant in CarbonCommonConstaint.java file
> to
> > use zstd as default compressor
> >
> >
> > What do you think?
> >
> >
> > Regards,
> > Jacky

--
Thanks & Regards,
Ravi