Apache CarbonData Dev Mailing List archive

Enhancement on compaction performance

Classic

List

Threaded

4 messages Options

xuchuanyin

Enhancement on compaction performance

Hi all:
I am raising a PR to enhance the performance of compaction. The PR number is #2906.

Based on my experiments using about 72GB LineItem data ( in 100GB TPCH data), I got the following results.

Code Branch Prefetch Batch Size (default 100) Load1 (s) Load2 (s) Load3 (s) Compact 3 Loads (s) Time Reduced
master NA 100 447.4 445.9 450.1 661.3 Base Line
master NA 32000 441.5 454.4 456.8 641.2 +3.0%
PR2906 enable 100 445.3 450.2 445.3 411.8 +37.7%
PR2906 enable 32000 438.7 446.8 441.8 333.1 +49.6%
PR2906 disable 100 458.1 459.4 450.9 659.5 +0.3%
PR2906 disable 32000 472.0 446.8 457.1 654.5 +1.0%
Note: These tests are under spark-2.2 version

The results show that compaction performance is almost doubled if configured properly.
It also shows even if this feature is disabled, the compaction performance still not decrease.

So here:

1. I do want to make this feature ‘enabled’ by default.

2. Besides, I’d want the others in the community also test this feature and check whether we can benefit from this feature.

Any feedback is welcome.

xuchuanyin

Re: Enhancement on compaction performance

Hi, all:

The previous experiment uses 3 huawei ecs instances as workers each with 16
cores and 32GB. Spark executor use 12 cores and 24GB. Using 74GB LineItem in
100GB TPCH.

Today I run another experiment using 1 huawei RH2288 machine with 32 cores
and 128GB. Spark executor use 30 cores and 90GB. Using 7.3GB LineItem in
10GB TPCH. And the results are as below:

Code Branch Prefetch Batch Size (default 100) Load1 (s) Load2 (s) Load3 (s)
Compact 3 Loads (s) Time Reduced Perf Enhanced
master NA 100 147.4 142.3 144.6 201.4 Baseline Baseline
master NA 32000 140.8 138.7 141.6 196.2 2.6% 2.7%
PR2906 enable 100 143.9 142.5 146.2 99.9 50.4% 101.6%
PR2906 enable 32000 142.1 139.3 136.9 98.3 51.2% 104.9%
PR2906 disable 100 146.7 137.4 139.6 200.6 0.4% 0.4%
PR2906 disable 32000 145.2 145.0 139.7 195.7 2.8% 2.9%

It also shows this PR will not decrease the compaction performance if
disabled and will enhance the performance if enabled.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: Enhancement on compaction performance

In reply to this post by xuchuanyin

Hi Xuchuanyin,

This feature is great for compaction. I wonder do you observe more memory is used since it prefetch data in the memory? Do you have any number?

Regards,
Jacky

> 在 2018年11月7日，下午11:54，xuchuanyin <[hidden email]> 写道：
>
> Hi all:
> I am raising a PR to enhance the performance of compaction. The PR number is #2906.
>
> Based on my experiments using about 72GB LineItem data ( in 100GB TPCH data), I got the following results.
>
> Code Branch Prefetch Batch Size (default 100) Load1 (s) Load2 (s) Load3 (s) Compact 3 Loads (s) Time Reduced
> master NA 100 447.4 445.9 450.1 661.3 Base Line
> master NA 32000 441.5 454.4 456.8 641.2 +3.0%
> PR2906 enable 100 445.3 450.2 445.3 411.8 +37.7%
> PR2906 enable 32000 438.7 446.8 441.8 333.1 +49.6%
> PR2906 disable 100 458.1 459.4 450.9 659.5 +0.3%
> PR2906 disable 32000 472.0 446.8 457.1 654.5 +1.0%
> Note: These tests are under spark-2.2 version
>
> The results show that compaction performance is almost doubled if configured properly.
> It also shows even if this feature is disabled, the compaction performance still not decrease.
>
> So here:
>
> 1. I do want to make this feature ‘enabled’ by default.
>
> 2. Besides, I’d want the others in the community also test this feature and check whether we can benefit from this feature.
>
> Any feedback is welcome.
>
>

xuchuanyin

Re: Enhancement on compaction performance

Oh, I didn't notice the memory consumption at that time.

We all know that the resource utilization is low during compaction.
Using prefetch means that We are doing query background and it will surely
consume more resources.
Current size of prefetch is controlled by the 'carbon.detail.batch.size' and
by default is 100 which means extra 100 rows will be kept in memory before
it is retrieved.
So the memory overhead consists the memory consumed by the query plus the
memory of the #carbon.detail.batch.size records.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/