Proposal to integrate QATCodec into Carbondata

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
Hi all
I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits:
1) A wide ecosystem support
Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive.
2) High performance and space efficiency
We measured the performance and compression ratio of QATCodec in different workloads comparing against Snappy.
For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio.
Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device.
Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration.

Please feel free to share your comments on this proposal.


[1] https://github.com/intel-hadoop/QATCodec
[2] https://01.org/zh/intel-quickassist-technology
[3] http://www.tpc.org/tpcx-bb/default.asp

Best Regards
Ferdinand Xu

Reply | Threaded
Open this post in threaded view
|

Recall: Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
Xu, Cheng A would like to recall the message, "Proposal to integrate QATCodec into Carbondata".
Reply | Threaded
Open this post in threaded view
|

Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
In reply to this post by Xu, Cheng A
Hi all
I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits:
1) A wide ecosystem support
Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive.
2) High performance and space efficiency
We measured the performance and compression ratio of QATCodec in different workloads comparing against Snappy.
For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio.
Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device.
Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration.

Please feel free to share your comments on this proposal.


[1] https://github.com/intel-hadoop/IntelQATCodec
[2] https://01.org/zh/intel-quickassist-technology
[3] http://www.tpc.org/tpcx-bb/default.asp

Best Regards
Ferdinand Xu

Reply | Threaded
Open this post in threaded view
|

Re: Proposal to integrate QATCodec into Carbondata

xuchuanyin
emm, if it only needs to extend another compressor for software
implementation, I think it will be quite easy to integrate.

Actually a PR has already been raised weeks ago to support customize
compressor in carbondata, you can refer to this link:
https://github.com/apache/carbondata/pull/2715. You can refer to the
`CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more
information.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

RE: Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
Thanks Chuanyin. This PR looks cool. Allowing customized codec is a good option. Comparing with the existing built-in Snappy codec in CarbonData, I think QATCodec with better performance and better compression ratio is also a good candidate for built-in support. Any thoughts?

Thanks
Ferdinand Xu

-----Original Message-----
From: xuchuanyin [mailto:[hidden email]]
Sent: Friday, October 12, 2018 11:07 AM
To: [hidden email]
Subject: Re: Proposal to integrate QATCodec into Carbondata

emm, if it only needs to extend another compressor for software implementation, I think it will be quite easy to integrate.

Actually a PR has already been raised weeks ago to support customize compressor in carbondata, you can refer to this link:
https://github.com/apache/carbondata/pull/2715. You can refer to the `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more information.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to integrate QATCodec into Carbondata

sraghunandan
+1
This would further enhance the performance of queries where io is the
bottleneck

Regards
Raghu

On Fri, 12 Oct 2018, 12:18 pm Xu, Cheng A, <[hidden email]> wrote:

> Thanks Chuanyin. This PR looks cool. Allowing customized codec is a good
> option. Comparing with the existing built-in Snappy codec in CarbonData, I
> think QATCodec with better performance and better compression ratio is also
> a good candidate for built-in support. Any thoughts?
>
> Thanks
> Ferdinand Xu
>
> -----Original Message-----
> From: xuchuanyin [mailto:[hidden email]]
> Sent: Friday, October 12, 2018 11:07 AM
> To: [hidden email]
> Subject: Re: Proposal to integrate QATCodec into Carbondata
>
> emm, if it only needs to extend another compressor for software
> implementation, I think it will be quite easy to integrate.
>
> Actually a PR has already been raised weeks ago to support customize
> compressor in carbondata, you can refer to this link:
> https://github.com/apache/carbondata/pull/2715. You can refer to the
> `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more
> information.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

RE: Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
Thanks Raghu. Will prepare for the pull request.

Thanks
Ferdinand Xu


-----Original Message-----
From: Raghunandan S [mailto:[hidden email]]
Sent: Thursday, November 1, 2018 7:09 PM
To: [hidden email]
Subject: Re: Proposal to integrate QATCodec into Carbondata

+1
This would further enhance the performance of queries where io is the bottleneck

Regards
Raghu

On Fri, 12 Oct 2018, 12:18 pm Xu, Cheng A, <[hidden email]> wrote:

> Thanks Chuanyin. This PR looks cool. Allowing customized codec is a
> good option. Comparing with the existing built-in Snappy codec in
> CarbonData, I think QATCodec with better performance and better
> compression ratio is also a good candidate for built-in support. Any thoughts?
>
> Thanks
> Ferdinand Xu
>
> -----Original Message-----
> From: xuchuanyin [mailto:[hidden email]]
> Sent: Friday, October 12, 2018 11:07 AM
> To: [hidden email]
> Subject: Re: Proposal to integrate QATCodec into Carbondata
>
> emm, if it only needs to extend another compressor for software
> implementation, I think it will be quite easy to integrate.
>
> Actually a PR has already been raised weeks ago to support customize
> compressor in carbondata, you can refer to this link:
> https://github.com/apache/carbondata/pull/2715. You can refer to the
> `CustomizeCompressor` in `TestLoadDataWithCompression.scala` for more
> information.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.co
> m/
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal to integrate QATCodec into Carbondata

Jacky Li
In reply to this post by Xu, Cheng A
Hi,

Good to know about QATCodec. I have a quick question. Is the QATCodec an independent compression/decompression library or it depends on any hardware to achieve the performance improvement you have mentioned?

Is there any link for QATCodec project or source code?

Regards,
Jacky

> 在 2018年10月12日,上午10:40,Xu, Cheng A <[hidden email]> 写道:
>
> Hi all
> I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits:
> 1) A wide ecosystem support
> Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive.
> 2) High performance and space efficiency
> We measured the performance and compression ratio of QATCodec in different workloads comparing against Snappy.
> For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio.
> Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device.
> Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration.
>
> Please feel free to share your comments on this proposal.
>
>
> [1] https://github.com/intel-hadoop/IntelQATCodec
> [2] https://01.org/zh/intel-quickassist-technology
> [3] http://www.tpc.org/tpcx-bb/default.asp
>
> Best Regards
> Ferdinand Xu
>
>



Reply | Threaded
Open this post in threaded view
|

RE: Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
Hi Jacky
The repo address is at https://github.com/intel-hadoop/IntelQATCodec open source with apache license. Regards to the hardware dependencies and its performance, it needs the extra QAT device [1] to have the hardware acceleration and will fall back to SW based GZip implementation.
Its performance has been certificated by Cloudera [2].

[1] https://www.intel.cn/content/www/cn/zh/architecture-and-technology/intel-quick-assist-technology-overview.html 
[2] https://www.cloudera.com/partners/partners-listing.html?q=intel 

Thanks
Ferdinand Xu


-----Original Message-----
From: Jacky Li [mailto:[hidden email]]
Sent: Thursday, November 1, 2018 8:13 PM
To: [hidden email]
Subject: Re: Proposal to integrate QATCodec into Carbondata

Hi,

Good to know about QATCodec. I have a quick question. Is the QATCodec an independent compression/decompression library or it depends on any hardware to achieve the performance improvement you have mentioned?

Is there any link for QATCodec project or source code?

Regards,
Jacky

> 在 2018年10月12日,上午10:40,Xu, Cheng A <[hidden email]> 写道:
>
> Hi all
> I want to make a proposal to support QATCodec [1] into CarbonData. QAT Codec project provides compression and decompression library for Apache Hadoop/Spark to make use of the Intel(r) QuickAssist Technology (Abbrev. QAT) [2] for compression/decompression. This project has been open source this year as well as the underlying native dependencies - QATZip. And users can install the underlying native dependencies using linux package-management utility (e.g. Yum for Centos). This projects have two major benefits:
> 1) A wide ecosystem support
> Now it supports Hadoop & Spark directly by implementing Hadoop & Spark de/compression API and also provides patches to integrate with Parquet and ORC-Hive.
> 2) High performance and space efficiency We measured the performance
> and compression ratio of QATCodec in different workloads comparing against Snappy.
> For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio.
> Regards to the hardware requirement, current implementation supports falling-back mechanism to software implementation at the absent of QAT device.
> Now Carbondata supports two compression codec: Zstd and Snappy. I think it will bring the benefit to the users to have an extra compression option with hardware acceleration.
>
> Please feel free to share your comments on this proposal.
>
>
> [1] https://github.com/intel-hadoop/IntelQATCodec
> [2] https://01.org/zh/intel-quickassist-technology
> [3] http://www.tpc.org/tpcx-bb/default.asp
>
> Best Regards
> Ferdinand Xu
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Proposal to integrate QATCodec into Carbondata

brijoobopanna
In reply to this post by Xu, Cheng A
Thanks por proposing this QATCodec

If any performance benchmarks are already available wrt Snappy or ZSTD



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

RE: Proposal to integrate QATCodec into Carbondata

Xu, Cheng A
Yes, we have the performance number against Snappy. It's included in our proposal. The performance various depending on workloads.

> For the sort workload (input, intermediate data, output are all compression-enabled, 3TB data scale, 5 workers, 2 replica for data) with Map Reduce, using QATCodec brings 7.29% performance gain and 7.5% better compression ratio. For the sort workload (input and intermediate data are compression-enabled, 3TB data scale) with Spark, it brings 14.3% performance gain, 7.5% better compression ratio. Also we measured in Hive on MR with TPCx-BB workload [3] (3TB data scale), it brings 12.98% performance gain, 13.65% better compression ratio.

Thanks
Ferdinand Xu

-----Original Message-----
From: brijoobopanna [mailto:[hidden email]]
Sent: Monday, November 5, 2018 5:45 PM
To: [hidden email]
Subject: Re: Proposal to integrate QATCodec into Carbondata

Thanks por proposing this QATCodec

If any performance benchmarks are already available wrt Snappy or ZSTD



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/