Hi,
I compared snappy and zstd compressor using TPCH for carbondata. For TPCH lineitem table: carbon-zstdcarbon-snappy loading (s)5351 size795MB1.2GB TPCH-query: Q14.2898.29 Q212.60912.986 Q314.90214.458 Q46.2765.954 Q523.14721.946 Q61.120.945 Q723.01728.007 Q814.55415.077 Q928.47227.473 Q1024.06724.682 Q113.3213.79 Q125.3115.185 Q1314.0811.84 Q142.2622.087 Q155.4964.772 Q1629.91929.833 Q177.0187.057 Q1817.36717.795 Q192.9312.865 Q2011.34710.937 Q2126.41628.414 Q225.9236.311 sum283.844290.704 As you can see, after using zstd, table size is 33% reduced comparing to snappy. And the data loading and query time difference is negligible. So I suggest to change the default compressor in carbondata from snappy to zstd. To change the default compressor, we need to: 1. append the compressor name in the carbondata file name. So that from the file name user can know what compressor is used. For example, file name will be changed from part-0-0_batchno0-0-0-1580982686749.carbondata to part-0-0_batchno0-0-0-1580982686749.snappy.carbondata or part-0-0_batchno0-0-0-1580982686749.zstd.carbondata 2. Change the compressor constant in CarbonCommonConstaint.java file to use zstd as default compressor What do you think? Regards, Jacky |
Hi,
33% is huge a reduction in store size. If there is negligible difference in load and query time, we should definitely go for it. And does user really need to know about what compression is used ? change in file name may be need to handle compatibility. Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor name. query time decoding can be based on this. Thanks, Ajantha On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote: > Hi, > > > I compared snappy and zstd compressor using TPCH for carbondata. > > > For TPCH lineitem table: > carbon-zstdcarbon-snappy > loading (s)5351 > size795MB1.2GB > > TPCH-query: > Q14.2898.29 > Q212.60912.986 > Q314.90214.458 > Q46.2765.954 > Q523.14721.946 > Q61.120.945 > Q723.01728.007 > Q814.55415.077 > Q928.47227.473 > Q1024.06724.682 > Q113.3213.79 > Q125.3115.185 > Q1314.0811.84 > Q142.2622.087 > Q155.4964.772 > Q1629.91929.833 > Q177.0187.057 > Q1817.36717.795 > Q192.9312.865 > Q2011.34710.937 > Q2126.41628.414 > Q225.9236.311 > sum283.844290.704 > > > As you can see, after using zstd, table size is 33% reduced comparing to > snappy. And the data loading and query time difference is negligible. So I > suggest to change the default compressor in carbondata from snappy to zstd. > > > To change the default compressor, we need to: > 1. append the compressor name in the carbondata file name. So that from > the file name user can know what compressor is used. > For example, file name will be changed from > part-0-0_batchno0-0-0-1580982686749.carbondata > to part-0-0_batchno0-0-0-1580982686749.snappy.carbondata > or part-0-0_batchno0-0-0-1580982686749.zstd.carbondata > > > 2. Change the compressor constant in CarbonCommonConstaint.java file to > use zstd as default compressor > > > What do you think? > > > Regards, > Jacky |
Hi Ajantha,
Yes, decoder will use the compressorName stored in ChunkCompressionMeta from the file header, but I think it is better to put it in the name so that user can know the compressor in the shell without reading it by launching engine. In spark, for parquet/orc the file name written is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc In PR3606, I will handle the compatibility. Regards, Jacky ------------------ 原始邮件 ------------------ 发件人: "Ajantha Bhat"<[hidden email]>; 发送时间: 2020年2月6日(星期四) 晚上11:51 收件人: "dev"<[hidden email]>; 主题: Re: Discussion: change default compressor to ZSTD Hi, 33% is huge a reduction in store size. If there is negligible difference in load and query time, we should definitely go for it. And does user really need to know about what compression is used ? change in file name may be need to handle compatibility. Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor name. query time decoding can be based on this. Thanks, Ajantha On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote: > Hi, > > > I compared snappy and zstd compressor using TPCH for carbondata. > > > For TPCH lineitem table: > carbon-zstdcarbon-snappy > loading (s)5351 > size795MB1.2GB > > TPCH-query: > Q14.2898.29 > Q212.60912.986 > Q314.90214.458 > Q46.2765.954 > Q523.14721.946 > Q61.120.945 > Q723.01728.007 > Q814.55415.077 > Q928.47227.473 > Q1024.06724.682 > Q113.3213.79 > Q125.3115.185 > Q1314.0811.84 > Q142.2622.087 > Q155.4964.772 > Q1629.91929.833 > Q177.0187.057 > Q1817.36717.795 > Q192.9312.865 > Q2011.34710.937 > Q2126.41628.414 > Q225.9236.311 > sum283.844290.704 > > > As you can see, after using zstd, table size is 33% reduced comparing to > snappy. And the data loading and query time difference is negligible. So I > suggest to change the default compressor in carbondata from snappy to zstd. > > > To change the default compressor, we need to: > 1. append the compressor name in the carbondata file name. So that from > the file name user can know what compressor is used. > For example, file name will be changed from > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata > to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata > > > 2. Change the compressor constant in CarbonCommonConstaint.java file to > use zstd as default compressor > > > What do you think? > > > Regards, > Jacky |
Hi Jacky,
As per the original PR https://github.com/apache/carbondata/pull/2628 , query performance got decreased by 20% ~ 50% compared to snappy. So I am concerned about the performance. Please better have a proper tpch performance report on the regular cluster like we do for every version and decide based on that. Regards, Ravindra. On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote: > Hi Ajantha, > > > Yes, decoder will use the compressorName stored in ChunkCompressionMeta > from the file header, > but I think it is better to put it in the name so that user can know the > compressor in the shell without reading it by launching engine. > > > In spark, for parquet/orc the file name written > is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc > > > In PR3606, I will handle the compatibility. > > > Regards, > Jacky > > > ------------------ 原始邮件 ------------------ > 发件人: "Ajantha Bhat"<[hidden email]>; > 发送时间: 2020年2月6日(星期四) 晚上11:51 > 收件人: "dev"<[hidden email]>; > > 主题: Re: Discussion: change default compressor to ZSTD > > > > Hi, > > 33% is huge a reduction in store size. If there is negligible difference in > load and query time, we should definitely go for it. > > And does user really need to know about what compression is used ? change > in file name may be need to handle compatibility. > Already thrift *FileHeader, ChunkCompressionMeta* is storing the compressor > name. query time decoding can be based on this. > > Thanks, > Ajantha > > > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote: > > > Hi, > > > > > > I compared snappy and zstd compressor using TPCH for carbondata. > > > > > > For TPCH lineitem table: > > carbon-zstdcarbon-snappy > > loading (s)5351 > > size795MB1.2GB > > > > TPCH-query: > > Q14.2898.29 > > Q212.60912.986 > > Q314.90214.458 > > Q46.2765.954 > > Q523.14721.946 > > Q61.120.945 > > Q723.01728.007 > > Q814.55415.077 > > Q928.47227.473 > > Q1024.06724.682 > > Q113.3213.79 > > Q125.3115.185 > > Q1314.0811.84 > > Q142.2622.087 > > Q155.4964.772 > > Q1629.91929.833 > > Q177.0187.057 > > Q1817.36717.795 > > Q192.9312.865 > > Q2011.34710.937 > > Q2126.41628.414 > > Q225.9236.311 > > sum283.844290.704 > > > > > > As you can see, after using zstd, table size is 33% reduced comparing > to > > snappy. And the data loading and query time difference is negligible. > So I > > suggest to change the default compressor in carbondata from snappy to > zstd. > > > > > > To change the default compressor, we need to: > > 1. append the compressor name in the carbondata file name. So that > from > > the file name user can know what compressor is used. > > For example, file name will be changed from > > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata > > > to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata > > > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata > > > > > > 2. Change the compressor constant in CarbonCommonConstaint.java file > to > > use zstd as default compressor > > > > > > What do you think? > > > > > > Regards, > > Jacky -- Thanks & Regards, Ravi |
Hi Jacky and Ravindra,
we have tested ZSTD vs snappy again with the latest code in 3 node spark 2.3 cluster on HDFS with TPCH 500 GB data. Below is the summary *1. ZSTD store is 28.8% smaller compared to snappy* *2. Overall query time is degraded by 18.35% in ZSTD compared to snappy* *3. Load time in ZSTD has negligible degradation of 0.7 % compared to snappy* Based on this, I guess we cannot use ZSTD as default due to huge degradation in query time. Thanks, Ajantha On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <[hidden email]> wrote: > Hi Jacky, > > As per the original PR > https://github.com/apache/carbondata/pull/2628 , query performance got > decreased by 20% ~ 50% compared to snappy. So I am concerned about the > performance. Please better have a proper tpch performance report on the > regular cluster like we do for every version and decide based on that. > > Regards, > Ravindra. > > On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote: > > > Hi Ajantha, > > > > > > Yes, decoder will use the compressorName stored in ChunkCompressionMeta > > from the file header, > > but I think it is better to put it in the name so that user can know the > > compressor in the shell without reading it by launching engine. > > > > > > In spark, for parquet/orc the file name written > > is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc > > > > > > In PR3606, I will handle the compatibility. > > > > > > Regards, > > Jacky > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: "Ajantha Bhat"<[hidden email]>; > > 发送时间: 2020年2月6日(星期四) 晚上11:51 > > 收件人: "dev"<[hidden email]>; > > > > 主题: Re: Discussion: change default compressor to ZSTD > > > > > > > > Hi, > > > > 33% is huge a reduction in store size. If there is negligible difference > in > > load and query time, we should definitely go for it. > > > > And does user really need to know about what compression is used ? change > > in file name may be need to handle compatibility. > > Already thrift *FileHeader, ChunkCompressionMeta* is storing the > compressor > > name. query time decoding can be based on this. > > > > Thanks, > > Ajantha > > > > > > On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote: > > > > > Hi, > > > > > > > > > I compared snappy and zstd compressor using TPCH for carbondata. > > > > > > > > > For TPCH lineitem table: > > > carbon-zstdcarbon-snappy > > > loading (s)5351 > > > size795MB1.2GB > > > > > > TPCH-query: > > > Q14.2898.29 > > > Q212.60912.986 > > > Q314.90214.458 > > > Q46.2765.954 > > > Q523.14721.946 > > > Q61.120.945 > > > Q723.01728.007 > > > Q814.55415.077 > > > Q928.47227.473 > > > Q1024.06724.682 > > > Q113.3213.79 > > > Q125.3115.185 > > > Q1314.0811.84 > > > Q142.2622.087 > > > Q155.4964.772 > > > Q1629.91929.833 > > > Q177.0187.057 > > > Q1817.36717.795 > > > Q192.9312.865 > > > Q2011.34710.937 > > > Q2126.41628.414 > > > Q225.9236.311 > > > sum283.844290.704 > > > > > > > > > As you can see, after using zstd, table size is 33% reduced > comparing > > to > > > snappy. And the data loading and query time difference is > negligible. > > So I > > > suggest to change the default compressor in carbondata from snappy > to > > zstd. > > > > > > > > > To change the default compressor, we need to: > > > 1. append the compressor name in the carbondata file name. So that > > from > > > the file name user can know what compressor is used. > > > For example, file name will be changed from > > > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata > > > > > > to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata > > > > > or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata > > > > > > > > > 2. Change the compressor constant in CarbonCommonConstaint.java file > > to > > > use zstd as default compressor > > > > > > > > > What do you think? > > > > > > > > > Regards, > > > Jacky > > -- > Thanks & Regards, > Ravi > |
Ok, thanks for the test.
Then for PR3606, I will only add the compressor name to the file name but not changing the default compressor to ZSTD. Regards, Jacky > 2020年2月20日 下午12:52,Ajantha Bhat <[hidden email]> 写道: > > Hi Jacky and Ravindra, > > we have tested ZSTD vs snappy again with the latest code in 3 node spark > 2.3 cluster on HDFS with TPCH 500 GB data. > Below is the summary > > *1. ZSTD store is 28.8% smaller compared to snappy* > *2. Overall query time is degraded by 18.35% in ZSTD compared to snappy* > *3. Load time in ZSTD has negligible degradation of 0.7 % compared to > snappy* > > Based on this, I guess we cannot use ZSTD as default due to huge > degradation in query time. > > Thanks, > Ajantha > > > > > On Fri, Feb 7, 2020 at 4:54 PM Ravindra Pesala <[hidden email]> > wrote: > >> Hi Jacky, >> >> As per the original PR >> https://github.com/apache/carbondata/pull/2628 , query performance got >> decreased by 20% ~ 50% compared to snappy. So I am concerned about the >> performance. Please better have a proper tpch performance report on the >> regular cluster like we do for every version and decide based on that. >> >> Regards, >> Ravindra. >> >> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li <[hidden email]> wrote: >> >>> Hi Ajantha, >>> >>> >>> Yes, decoder will use the compressorName stored in ChunkCompressionMeta >>> from the file header, >>> but I think it is better to put it in the name so that user can know the >>> compressor in the shell without reading it by launching engine. >>> >>> >>> In spark, for parquet/orc the file name written >>> is: part-00115-e2758995-4b10-4bd2-bf15-b4c176e587fe-c000.snappy.orc >>> >>> >>> In PR3606, I will handle the compatibility. >>> >>> >>> Regards, >>> Jacky >>> >>> >>> ------------------ 原始邮件 ------------------ >>> 发件人: "Ajantha Bhat"<[hidden email]>; >>> 发送时间: 2020年2月6日(星期四) 晚上11:51 >>> 收件人: "dev"<[hidden email]>; >>> >>> 主题: Re: Discussion: change default compressor to ZSTD >>> >>> >>> >>> Hi, >>> >>> 33% is huge a reduction in store size. If there is negligible difference >> in >>> load and query time, we should definitely go for it. >>> >>> And does user really need to know about what compression is used ? change >>> in file name may be need to handle compatibility. >>> Already thrift *FileHeader, ChunkCompressionMeta* is storing the >> compressor >>> name. query time decoding can be based on this. >>> >>> Thanks, >>> Ajantha >>> >>> >>> On Thu, Feb 6, 2020 at 4:27 PM Jacky Li <[hidden email]> wrote: >>> >>> > Hi, >>> > >>> > >>> > I compared snappy and zstd compressor using TPCH for carbondata. >>> > >>> > >>> > For TPCH lineitem table: >>> > carbon-zstdcarbon-snappy >>> > loading (s)5351 >>> > size795MB1.2GB >>> > >>> > TPCH-query: >>> > Q14.2898.29 >>> > Q212.60912.986 >>> > Q314.90214.458 >>> > Q46.2765.954 >>> > Q523.14721.946 >>> > Q61.120.945 >>> > Q723.01728.007 >>> > Q814.55415.077 >>> > Q928.47227.473 >>> > Q1024.06724.682 >>> > Q113.3213.79 >>> > Q125.3115.185 >>> > Q1314.0811.84 >>> > Q142.2622.087 >>> > Q155.4964.772 >>> > Q1629.91929.833 >>> > Q177.0187.057 >>> > Q1817.36717.795 >>> > Q192.9312.865 >>> > Q2011.34710.937 >>> > Q2126.41628.414 >>> > Q225.9236.311 >>> > sum283.844290.704 >>> > >>> > >>> > As you can see, after using zstd, table size is 33% reduced >> comparing >>> to >>> > snappy. And the data loading and query time difference is >> negligible. >>> So I >>> > suggest to change the default compressor in carbondata from snappy >> to >>> zstd. >>> > >>> > >>> > To change the default compressor, we need to: >>> > 1. append the compressor name in the carbondata file name. So that >>> from >>> > the file name user can know what compressor is used. >>> > For example, file name will be changed from >>> > &nbsp;part-0-0_batchno0-0-0-1580982686749.carbondata >>> > >>> >> to&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.snappy.carbondata >>> > >>> or&nbsp;&nbsp;part-0-0_batchno0-0-0-1580982686749.zstd.carbondata >>> > >>> > >>> > 2. Change the compressor constant in CarbonCommonConstaint.java file >>> to >>> > use zstd as default compressor >>> > >>> > >>> > What do you think? >>> > >>> > >>> > Regards, >>> > Jacky >> >> -- >> Thanks & Regards, >> Ravi >> |
Free forum by Nabble | Edit this page |