Hi all:
I am raising a PR to enhance the performance of compaction. The PR number is #2906. Based on my experiments using about 72GB LineItem data ( in 100GB TPCH data), I got the following results. Code Branch Prefetch Batch Size (default 100) Load1 (s) Load2 (s) Load3 (s) Compact 3 Loads (s) Time Reduced master NA 100 447.4 445.9 450.1 661.3 Base Line master NA 32000 441.5 454.4 456.8 641.2 +3.0% PR2906 enable 100 445.3 450.2 445.3 411.8 +37.7% PR2906 enable 32000 438.7 446.8 441.8 333.1 +49.6% PR2906 disable 100 458.1 459.4 450.9 659.5 +0.3% PR2906 disable 32000 472.0 446.8 457.1 654.5 +1.0% Note: These tests are under spark-2.2 version The results show that compaction performance is almost doubled if configured properly. It also shows even if this feature is disabled, the compaction performance still not decrease. So here: 1. I do want to make this feature ‘enabled’ by default. 2. Besides, I’d want the others in the community also test this feature and check whether we can benefit from this feature. Any feedback is welcome. |
Hi, all:
The previous experiment uses 3 huawei ecs instances as workers each with 16 cores and 32GB. Spark executor use 12 cores and 24GB. Using 74GB LineItem in 100GB TPCH. Today I run another experiment using 1 huawei RH2288 machine with 32 cores and 128GB. Spark executor use 30 cores and 90GB. Using 7.3GB LineItem in 10GB TPCH. And the results are as below: Code Branch Prefetch Batch Size (default 100) Load1 (s) Load2 (s) Load3 (s) Compact 3 Loads (s) Time Reduced Perf Enhanced master NA 100 147.4 142.3 144.6 201.4 Baseline Baseline master NA 32000 140.8 138.7 141.6 196.2 2.6% 2.7% PR2906 enable 100 143.9 142.5 146.2 99.9 50.4% 101.6% PR2906 enable 32000 142.1 139.3 136.9 98.3 51.2% 104.9% PR2906 disable 100 146.7 137.4 139.6 200.6 0.4% 0.4% PR2906 disable 32000 145.2 145.0 139.7 195.7 2.8% 2.9% It also shows this PR will not decrease the compaction performance if disabled and will enhance the performance if enabled. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by xuchuanyin
Hi Xuchuanyin,
This feature is great for compaction. I wonder do you observe more memory is used since it prefetch data in the memory? Do you have any number? Regards, Jacky > 在 2018年11月7日,下午11:54,xuchuanyin <[hidden email]> 写道: > > Hi all: > I am raising a PR to enhance the performance of compaction. The PR number is #2906. > > Based on my experiments using about 72GB LineItem data ( in 100GB TPCH data), I got the following results. > > Code Branch Prefetch Batch Size (default 100) Load1 (s) Load2 (s) Load3 (s) Compact 3 Loads (s) Time Reduced > master NA 100 447.4 445.9 450.1 661.3 Base Line > master NA 32000 441.5 454.4 456.8 641.2 +3.0% > PR2906 enable 100 445.3 450.2 445.3 411.8 +37.7% > PR2906 enable 32000 438.7 446.8 441.8 333.1 +49.6% > PR2906 disable 100 458.1 459.4 450.9 659.5 +0.3% > PR2906 disable 32000 472.0 446.8 457.1 654.5 +1.0% > Note: These tests are under spark-2.2 version > > The results show that compaction performance is almost doubled if configured properly. > It also shows even if this feature is disabled, the compaction performance still not decrease. > > So here: > > 1. I do want to make this feature ‘enabled’ by default. > > 2. Besides, I’d want the others in the community also test this feature and check whether we can benefit from this feature. > > Any feedback is welcome. > > |
Oh, I didn't notice the memory consumption at that time.
We all know that the resource utilization is low during compaction. Using prefetch means that We are doing query background and it will surely consume more resources. Current size of prefetch is controlled by the 'carbon.detail.batch.size' and by default is 100 which means extra 100 rows will be kept in memory before it is retrieved. So the memory overhead consists the memory consumed by the query plus the memory of the #carbon.detail.batch.size records. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |