Apache CarbonData Dev Mailing List archive

Discussion: change default compressor to ZSTD

Classic

List

Threaded

6 messages Options

Jacky Li

Discussion: change default compressor to ZSTD

Hi,

I compared snappy and zstd compressor using TPCH for carbondata.

For TPCH lineitem table:
carbon-zstdcarbon-snappy
loading (s)5351
size795MB1.2GB

TPCH-query:
Q14.2898.29
Q212.60912.986
Q314.90214.458
Q46.2765.954
Q523.14721.946
Q61.120.945
Q723.01728.007
Q814.55415.077
Q928.47227.473
Q1024.06724.682
Q113.3213.79
Q125.3115.185
Q1314.0811.84
Q142.2622.087
Q155.4964.772
Q1629.91929.833
Q177.0187.057
Q1817.36717.795
Q192.9312.865
Q2011.34710.937
Q2126.41628.414
Q225.9236.311
sum283.844290.704

As you can see, after using zstd, table size is 33% reduced comparing to snappy. And the data loading and query time difference is negligible. So I suggest to change the default compressor in carbondata from snappy to zstd.

To change the default compressor, we need to:
1. append the compressor name in the carbondata file name. So that from the file name user can know what compressor is used.
For example, file name will be changed from
 part-0-0_batchno0-0-0-1580982686749.carbondata to  part-0-0_batchno0-0-0-1580982686749.snappy.carbondata or  part-0-0_batchno0-0-0-1580982686749.zstd.carbondata

2. Change the compressor constant in CarbonCommonConstaint.java file to use zstd as default compressor

What do you think?

Regards,
Jacky

Ajantha Bhat

Re: Discussion: change default compressor to ZSTD

Hi,

33% is huge a reduction in store size. If there is negligible difference in
load and query time, we should definitely go for it.

And does user really need to know about what compression is used ? change
in file name may be need to handle compatibility.
Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
name. query time decoding can be based on this.

Thanks,
Ajantha

On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:

> Hi,
>
>
> I compared snappy and zstd compressor using TPCH for carbondata.
>
>
> For TPCH lineitem table:
> carbon-zstdcarbon-snappy
> loading (s)5351
> size795MB1.2GB
>
> TPCH-query:
> Q14.2898.29
> Q212.60912.986
> Q314.90214.458
> Q46.2765.954
> Q523.14721.946
> Q61.120.945
> Q723.01728.007
> Q814.55415.077
> Q928.47227.473
> Q1024.06724.682
> Q113.3213.79
> Q125.3115.185
> Q1314.0811.84
> Q142.2622.087
> Q155.4964.772
> Q1629.91929.833
> Q177.0187.057
> Q1817.36717.795
> Q192.9312.865
> Q2011.34710.937
> Q2126.41628.414
> Q225.9236.311
> sum283.844290.704
>
>
> As you can see, after using zstd, table size is 33% reduced comparing to
> snappy. And the data loading and query time difference is negligible. So I
> suggest to change the default compressor in carbondata from snappy to zstd.
>
>
> To change the default compressor, we need to:
> 1. append the compressor name in the carbondata file name. So that from
> the file name user can know what compressor is used.
> For example, file name will be changed from
>  part-0-0_batchno0-0-0-1580982686749.carbondata
> to  part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> or  part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
>
>
> 2. Change the compressor constant in CarbonCommonConstaint.java file to
> use zstd as default compressor
>
>
> What do you think?
>
>
> Regards,
> Jacky

Jacky Li

回复： Discussion: change default compressor to ZSTD

Hi Ajantha,

Yes, decoder will use the compressorName stored in ChunkCompressionMeta from the file header,
but I think it is better to put it in the name so that user can know the compressor in the shell without reading it by launching engine.

In spark, for parquet/orc the file name written is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc

In PR3606, I will handle the compatibility.

Regards,
Jacky

------------------ 原始邮件 ------------------
发件人: "Ajantha Bhat"<[hidden email]>;
发送时间: 2020年2月6日(星期四) 晚上11:51
收件人: "dev"<[hidden email]>;

主题: Re: Discussion: change default compressor to ZSTD

Hi,

33% is huge a reduction in store size. If there is negligible difference in
load and query time, we should definitely go for it.

And does user really need to know about what compression is used ? change
in file name may be need to handle compatibility.
Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
name. query time decoding can be based on this.

Thanks,
Ajantha

On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:

> Hi,
>
>
> I compared snappy and zstd compressor using TPCH for carbondata.
>
>
> For TPCH lineitem table:
> carbon-zstdcarbon-snappy
> loading (s)5351
> size795MB1.2GB
>
> TPCH-query:
> Q14.2898.29
> Q212.60912.986
> Q314.90214.458
> Q46.2765.954
> Q523.14721.946
> Q61.120.945
> Q723.01728.007
> Q814.55415.077
> Q928.47227.473
> Q1024.06724.682
> Q113.3213.79
> Q125.3115.185
> Q1314.0811.84
> Q142.2622.087
> Q155.4964.772
> Q1629.91929.833
> Q177.0187.057
> Q1817.36717.795
> Q192.9312.865
> Q2011.34710.937
> Q2126.41628.414
> Q225.9236.311
> sum283.844290.704
>
>
> As you can see, after using zstd, table size is 33% reduced comparing to
> snappy. And the data loading and query time difference is negligible. So I
> suggest to change the default compressor in carbondata from snappy to zstd.
>
>
> To change the default compressor, we need to:
> 1. append the compressor name in the carbondata file name. So that from
> the file name user can know what compressor is used.
> For example, file name will be changed from
> &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
>
>
> 2. Change the compressor constant in CarbonCommonConstaint.java file to
> use zstd as default compressor
>
>
> What do you think?
>
>
> Regards,
> Jacky

ravipesala

Re: Discussion: change default compressor to ZSTD

Hi Jacky,

As per the original PR
https://github.com/apache/carbondata/pull/2628 , query performance got
decreased by 20% ~ 50% compared to snappy. So I am concerned about the
performance. Please better have a proper tpch performance report on the
regular cluster like we do for every version and decide based on that.

Regards,
Ravindra.

On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote:

> Hi Ajantha,
>
>
> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> from the file header,
> but I think it is better to put it in the name so that user can know the
> compressor in the shell without reading it by launching engine.
>
>
> In spark, for parquet/orc the file name written
> is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>
>
> In PR3606, I will handle the compatibility.
>
>
> Regards,
> Jacky
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Ajantha Bhat"<[hidden email]>;
> 发送时间: 2020年2月6日(星期四) 晚上11:51
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: Discussion: change default compressor to ZSTD
>
>
>
> Hi,
>
> 33% is huge a reduction in store size. If there is negligible difference in
> load and query time, we should definitely go for it.
>
> And does user really need to know about what compression is used ? change
> in file name may be need to handle compatibility.
> Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor
> name. query time decoding can be based on this.
>
> Thanks,
> Ajantha
>
>
> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:
>
> > Hi,
> >
> >
> > I compared snappy and zstd compressor using TPCH for carbondata.
> >
> >
> > For TPCH lineitem table:
> > carbon-zstdcarbon-snappy
> > loading (s)5351
> > size795MB1.2GB
> >
> > TPCH-query:
> > Q14.2898.29
> > Q212.60912.986
> > Q314.90214.458
> > Q46.2765.954
> > Q523.14721.946
> > Q61.120.945
> > Q723.01728.007
> > Q814.55415.077
> > Q928.47227.473
> > Q1024.06724.682
> > Q113.3213.79
> > Q125.3115.185
> > Q1314.0811.84
> > Q142.2622.087
> > Q155.4964.772
> > Q1629.91929.833
> > Q177.0187.057
> > Q1817.36717.795
> > Q192.9312.865
> > Q2011.34710.937
> > Q2126.41628.414
> > Q225.9236.311
> > sum283.844290.704
> >
> >
> > As you can see, after using zstd, table size is 33% reduced comparing
> to
> > snappy. And the data loading and query time difference is negligible.
> So I
> > suggest to change the default compressor in carbondata from snappy to
> zstd.
> >
> >
> > To change the default compressor, we need to:
> > 1. append the compressor name in the carbondata file name. So that
> from
> > the file name user can know what compressor is used.
> > For example, file name will be changed from
> > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> >
> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> >
> or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> >
> >
> > 2. Change the compressor constant in CarbonCommonConstaint.java file
> to
> > use zstd as default compressor
> >
> >
> > What do you think?
> >
> >
> > Regards,
> > Jacky

--
Thanks & Regards,
Ravi

Ajantha Bhat

Re: Discussion: change default compressor to ZSTD

Hi Jacky and Ravindra,

we have tested ZSTD vs snappy again with the latest code in 3 node spark
2.3 cluster on HDFS with TPCH 500 GB data.
Below is the summary

*1. ZSTD store is 28.8% smaller compared to snappy*
*2. Overall query time is degraded by 18.35% in ZSTD compared to snappy*
*3. Load time in ZSTD has negligible degradation of 0.7 % compared to
snappy*

Based on this, I guess we cannot use ZSTD as default due to huge
degradation in query time.

Thanks,
Ajantha

On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <[hidden email]>
wrote:

> Hi Jacky,
>
> As per the original PR
> https://github.com/apache/carbondata/pull/2628 , query performance got
> decreased by 20% ~ 50% compared to snappy. So I am concerned about the
> performance. Please better have a proper tpch performance report on the
> regular cluster like we do for every version and decide based on that.
>
> Regards,
> Ravindra.
>
> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote:
>
> > Hi Ajantha,
> >
> >
> > Yes, decoder will use the compressorName stored in ChunkCompressionMeta
> > from the file header,
> > but I think it is better to put it in the name so that user can know the
> > compressor in the shell without reading it by launching engine.
> >
> >
> > In spark, for parquet/orc the file name written
> > is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
> >
> >
> > In PR3606, I will handle the compatibility.
> >
> >
> > Regards,
> > Jacky
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Ajantha Bhat"<[hidden email]>;
> > 发送时间: 2020年2月6日(星期四) 晚上11:51
> > 收件人: "dev"<[hidden email]>;
> >
> > 主题: Re: Discussion: change default compressor to ZSTD
> >
> >
> >
> > Hi,
> >
> > 33% is huge a reduction in store size. If there is negligible difference
> in
> > load and query time, we should definitely go for it.
> >
> > And does user really need to know about what compression is used ? change
> > in file name may be need to handle compatibility.
> > Already thrift *FileHeader, ChunkCompressionMeta* is storing the
> compressor
> > name. query time decoding can be based on this.
> >
> > Thanks,
> > Ajantha
> >
> >
> > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > >
> > > I compared snappy and zstd compressor using TPCH for carbondata.
> > >
> > >
> > > For TPCH lineitem table:
> > > carbon-zstdcarbon-snappy
> > > loading (s)5351
> > > size795MB1.2GB
> > >
> > > TPCH-query:
> > > Q14.2898.29
> > > Q212.60912.986
> > > Q314.90214.458
> > > Q46.2765.954
> > > Q523.14721.946
> > > Q61.120.945
> > > Q723.01728.007
> > > Q814.55415.077
> > > Q928.47227.473
> > > Q1024.06724.682
> > > Q113.3213.79
> > > Q125.3115.185
> > > Q1314.0811.84
> > > Q142.2622.087
> > > Q155.4964.772
> > > Q1629.91929.833
> > > Q177.0187.057
> > > Q1817.36717.795
> > > Q192.9312.865
> > > Q2011.34710.937
> > > Q2126.41628.414
> > > Q225.9236.311
> > > sum283.844290.704
> > >
> > >
> > > As you can see, after using zstd, table size is 33% reduced
> comparing
> > to
> > > snappy. And the data loading and query time difference is
> negligible.
> > So I
> > > suggest to change the default compressor in carbondata from snappy
> to
> > zstd.
> > >
> > >
> > > To change the default compressor, we need to:
> > > 1. append the compressor name in the carbondata file name. So that
> > from
> > > the file name user can know what compressor is used.
> > > For example, file name will be changed from
> > > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
> > >
> >
> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
> > >
> > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
> > >
> > >
> > > 2. Change the compressor constant in CarbonCommonConstaint.java file
> > to
> > > use zstd as default compressor
> > >
> > >
> > > What do you think?
> > >
> > >
> > > Regards,
> > > Jacky
>
> --
> Thanks & Regards,
> Ravi
>

Jacky Li

Re: Discussion: change default compressor to ZSTD

Ok, thanks for the test.
Then for PR3606, I will only add the compressor name to the file name but not changing the default compressor to ZSTD.

Regards,
Jacky

> 2020年2月20日下午12:52，Ajantha Bhat <[hidden email]> 写道：
>
> Hi Jacky and Ravindra,
>
> we have tested ZSTD vs snappy again with the latest code in 3 node spark
> 2.3 cluster on HDFS with TPCH 500 GB data.
> Below is the summary
>
> *1. ZSTD store is 28.8% smaller compared to snappy*
> *2. Overall query time is degraded by 18.35% in ZSTD compared to snappy*
> *3. Load time in ZSTD has negligible degradation of 0.7 % compared to
> snappy*
>
> Based on this, I guess we cannot use ZSTD as default due to huge
> degradation in query time.
>
> Thanks,
> Ajantha
>
>
>
>
> On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <[hidden email]>
> wrote:
>
>> Hi Jacky,
>>
>> As per the original PR
>> https://github.com/apache/carbondata/pull/2628 , query performance got
>> decreased by 20% ~ 50% compared to snappy. So I am concerned about the
>> performance. Please better have a proper tpch performance report on the
>> regular cluster like we do for every version and decide based on that.
>>
>> Regards,
>> Ravindra.
>>
>> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote:
>>
>>> Hi Ajantha,
>>>
>>>
>>> Yes, decoder will use the compressorName stored in ChunkCompressionMeta
>>> from the file header,
>>> but I think it is better to put it in the name so that user can know the
>>> compressor in the shell without reading it by launching engine.
>>>
>>>
>>> In spark, for parquet/orc the file name written
>>> is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc
>>>
>>>
>>> In PR3606, I will handle the compatibility.
>>>
>>>
>>> Regards,
>>> Jacky
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> 发件人: "Ajantha Bhat"<[hidden email]>;
>>> 发送时间: 2020年2月6日(星期四) 晚上11:51
>>> 收件人: "dev"<[hidden email]>;
>>>
>>> 主题: Re: Discussion: change default compressor to ZSTD
>>>
>>>
>>>
>>> Hi,
>>>
>>> 33% is huge a reduction in store size. If there is negligible difference
>> in
>>> load and query time, we should definitely go for it.
>>>
>>> And does user really need to know about what compression is used ? change
>>> in file name may be need to handle compatibility.
>>> Already thrift *FileHeader, ChunkCompressionMeta* is storing the
>> compressor
>>> name. query time decoding can be based on this.
>>>
>>> Thanks,
>>> Ajantha
>>>
>>>
>>> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote:
>>>
>>> > Hi,
>>> >
>>> >
>>> > I compared snappy and zstd compressor using TPCH for carbondata.
>>> >
>>> >
>>> > For TPCH lineitem table:
>>> > carbon-zstdcarbon-snappy
>>> > loading (s)5351
>>> > size795MB1.2GB
>>> >
>>> > TPCH-query:
>>> > Q14.2898.29
>>> > Q212.60912.986
>>> > Q314.90214.458
>>> > Q46.2765.954
>>> > Q523.14721.946
>>> > Q61.120.945
>>> > Q723.01728.007
>>> > Q814.55415.077
>>> > Q928.47227.473
>>> > Q1024.06724.682
>>> > Q113.3213.79
>>> > Q125.3115.185
>>> > Q1314.0811.84
>>> > Q142.2622.087
>>> > Q155.4964.772
>>> > Q1629.91929.833
>>> > Q177.0187.057
>>> > Q1817.36717.795
>>> > Q192.9312.865
>>> > Q2011.34710.937
>>> > Q2126.41628.414
>>> > Q225.9236.311
>>> > sum283.844290.704
>>> >
>>> >
>>> > As you can see, after using zstd, table size is 33% reduced
>> comparing
>>> to
>>> > snappy. And the data loading and query time difference is
>> negligible.
>>> So I
>>> > suggest to change the default compressor in carbondata from snappy
>> to
>>> zstd.
>>> >
>>> >
>>> > To change the default compressor, we need to:
>>> > 1. append the compressor name in the carbondata file name. So that
>>> from
>>> > the file name user can know what compressor is used.
>>> > For example, file name will be changed from
>>> > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata
>>> >
>>>
>> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata
>>> >
>>> or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
>>> >
>>> >
>>> > 2. Change the compressor constant in CarbonCommonConstaint.java file
>>> to
>>> > use zstd as default compressor
>>> >
>>> >
>>> > What do you think?
>>> >
>>> >
>>> > Regards,
>>> > Jacky
>>
>> --
>> Thanks & Regards,
>> Ravi
>>