Hi all
I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits: 1) A wide ecosystem support Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive. 2) High performance and space efficiency We measured the performance and compression ratio of QATCodec in different workloads comparing against Snappy. For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio. Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device. Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration. Please feel free to share your comments on this proposal. [1] https://github.com/intel-hadoop/QATCodec [2] https://01.org/zh/intel-quickassist-technology [3] http://www.tpc.org/tpcx-bb/default.asp Best Regards Ferdinand Xu |
Xu, Cheng A would like to recall the message, "Proposal to integrate QATCodec into Carbondata".
|
In reply to this post by Xu, Cheng A
Hi all
I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits: 1) A wide ecosystem support Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive. 2) High performance and space efficiency We measured the performance and compression ratio of QATCodec in different workloads comparing against Snappy. For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio. Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device. Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration. Please feel free to share your comments on this proposal. [1] https://github.com/intel-hadoop/IntelQATCodec [2] https://01.org/zh/intel-quickassist-technology [3] http://www.tpc.org/tpcx-bb/default.asp Best Regards Ferdinand Xu |
emm, if it only needs to extend another compressor for software
implementation, I think it will be quite easy to integrate. Actually a PR has already been raised weeks ago to support customize compressor in carbondata, you can refer to this link: https://github.com/apache/carbondata/pull/2715. You can refer to the `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more information. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Thanks Chuanyin. This PR looks cool. Allowing customized codec is a good option. Comparing with the existing built-in Snappy codec in CarbonData, I think QATCodec with better performance and better compression ratio is also a good candidate for built-in support. Any thoughts?
Thanks Ferdinand Xu -----Original Message----- From: xuchuanyin [mailto:[hidden email]] Sent: Friday, October 12, 2018 11:07 AM To: [hidden email] Subject: Re: Proposal to integrate QATCodec into Carbondata emm, if it only needs to extend another compressor for software implementation, I think it will be quite easy to integrate. Actually a PR has already been raised weeks ago to support customize compressor in carbondata, you can refer to this link: https://github.com/apache/carbondata/pull/2715. You can refer to the `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more information. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1
This would further enhance the performance of queries where io is the bottleneck Regards Raghu On Fri, 12 Oct 2018, 12:18 pm Xu, Cheng A, <[hidden email]> wrote: > Thanks Chuanyin. This PR looks cool. Allowing customized codec is a good > option. Comparing with the existing built-in Snappy codec in CarbonData, I > think QATCodec with better performance and better compression ratio is also > a good candidate for built-in support. Any thoughts? > > Thanks > Ferdinand Xu > > -----Original Message----- > From: xuchuanyin [mailto:[hidden email]] > Sent: Friday, October 12, 2018 11:07 AM > To: [hidden email] > Subject: Re: Proposal to integrate QATCodec into Carbondata > > emm, if it only needs to extend another compressor for software > implementation, I think it will be quite easy to integrate. > > Actually a PR has already been raised weeks ago to support customize > compressor in carbondata, you can refer to this link: > https://github.com/apache/carbondata/pull/2715. You can refer to the > `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more > information. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Thanks Raghu. Will prepare for the pull request.
Thanks Ferdinand Xu -----Original Message----- From: Raghunandan S [mailto:[hidden email]] Sent: Thursday, November 1, 2018 7:09 PM To: [hidden email] Subject: Re: Proposal to integrate QATCodec into Carbondata +1 This would further enhance the performance of queries where io is the bottleneck Regards Raghu On Fri, 12 Oct 2018, 12:18 pm Xu, Cheng A, <[hidden email]> wrote: > Thanks Chuanyin. This PR looks cool. Allowing customized codec is a > good option. Comparing with the existing built-in Snappy codec in > CarbonData, I think QATCodec with better performance and better > compression ratio is also a good candidate for built-in support. Any thoughts? > > Thanks > Ferdinand Xu > > -----Original Message----- > From: xuchuanyin [mailto:[hidden email]] > Sent: Friday, October 12, 2018 11:07 AM > To: [hidden email] > Subject: Re: Proposal to integrate QATCodec into Carbondata > > emm, if it only needs to extend another compressor for software > implementation, I think it will be quite easy to integrate. > > Actually a PR has already been raised weeks ago to support customize > compressor in carbondata, you can refer to this link: > https://github.com/apache/carbondata/pull/2715. You can refer to the > `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more > information. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.co > m/ > |
In reply to this post by Xu, Cheng A
Hi,
Good to know about QATCodec. I have a quick question. Is the QATCodec an independent compression/decompression library or it depends on any hardware to achieve the performance improvement you have mentioned? Is there any link for QATCodec project or source code? Regards, Jacky > 在 2018年10月12日,上午10:40,Xu, Cheng A <[hidden email]> 写道: > > Hi all > I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits: > 1) A wide ecosystem support > Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive. > 2) High performance and space efficiency > We measured the performance and compression ratio of QATCodec in different workloads comparing against Snappy. > For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio. > Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device. > Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration. > > Please feel free to share your comments on this proposal. > > > [1] https://github.com/intel-hadoop/IntelQATCodec > [2] https://01.org/zh/intel-quickassist-technology > [3] http://www.tpc.org/tpcx-bb/default.asp > > Best Regards > Ferdinand Xu > > |
Hi Jacky
The repo address is at https://github.com/intel-hadoop/IntelQATCodec open source with apache license. Regards to the hardware dependencies and its performance, it needs the extra QAT device [1] to have the hardware acceleration and will fall back to SW based GZip implementation. Its performance has been certificated by Cloudera [2]. [1] https://www.intel.cn/content/www/cn/zh/architecture-and-technology/intel-quick-assist-technology-overview.html [2] https://www.cloudera.com/partners/partners-listing.html?q=intel Thanks Ferdinand Xu -----Original Message----- From: Jacky Li [mailto:[hidden email]] Sent: Thursday, November 1, 2018 8:13 PM To: [hidden email] Subject: Re: Proposal to integrate QATCodec into Carbondata Hi, Good to know about QATCodec. I have a quick question. Is the QATCodec an independent compression/decompression library or it depends on any hardware to achieve the performance improvement you have mentioned? Is there any link for QATCodec project or source code? Regards, Jacky > 在 2018年10月12日,上午10:40,Xu, Cheng A <[hidden email]> 写道: > > Hi all > I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits: > 1) A wide ecosystem support > Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive. > 2) High performance and space efficiency We measured the performance > and compression ratio of QATCodec in different workloads comparing against Snappy. > For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio. > Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device. > Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration. > > Please feel free to share your comments on this proposal. > > > [1] https://github.com/intel-hadoop/IntelQATCodec > [2] https://01.org/zh/intel-quickassist-technology > [3] http://www.tpc.org/tpcx-bb/default.asp > > Best Regards > Ferdinand Xu > > |
In reply to this post by Xu, Cheng A
Thanks por proposing this QATCodec
If any performance benchmarks are already available wrt Snappy or ZSTD -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Yes, we have the performance number against Snappy. It's included in our proposal. The performance various depending on workloads.
> For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio. Thanks Ferdinand Xu -----Original Message----- From: brijoobopanna [mailto:[hidden email]] Sent: Monday, November 5, 2018 5:45 PM To: [hidden email] Subject: Re: Proposal to integrate QATCodec into Carbondata Thanks por proposing this QATCodec If any performance benchmarks are already available wrt Snappy or ZSTD -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |