Apache CarbonData Dev Mailing List archive

Re: Problem on carbondata quering performance tuning

Posted by Mick Yuan on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Problem-on-carbondata-quering-performance-tuning-tp44031p44502.html

Hi

Thanks very much for your reply.

For solution A,I don't care about data loading but quering,so I changed the resource allocation and ensured the number of containers which is running on yarn,but I don't feel that it's better for the job.

For solution B,it doesn't work,too.

So I need to provide more details to you.
The job group consists of five jobs,the first four jobs take small part of the whole time and they have only one stage,one or two tasks per stage.They are all CarbonScan RDD Stage.

The fifth job has four stages,the first two stages are CarbonScan RDD stage, first stage has two tasks and takes 2s, second stage has 39 tasks and takes 1s ,the other two are aggregate stage,each of the two stages has 216 tasks and one takes 3s,the other takes 0.9s.

I checked the driver log,and found that the time spend between paring and the statement you gave is less than 1 second.

Thanks.

------------------ Original ------------------
From: "BabuLal"<[hidden email]>;
Date: Tue, Apr 3, 2018 02:25 AM
To: "dev"<[hidden email]>;

Subject: Re: Problem on carbondata quering performance tuning

Hi

Thanks for using Carbondata.

Based on Information you provided , Please try below solutions /Points.

*A. Tune Resource Allocation *

You have 55 core/NM , and given spark.executor.cores= 54 which means
one NM will have only one Executor and total you will have only 4 Executor
even you have given spark.executor.instances 10 .
For Query Execution we need to have more Executor .
Cluster Capacity :-
Total NM=4
Core/NM=55
Memory/NM=102

Ideally(most of the case) per Executor 12-15 GB memory enough .Based on
this we can open 6 Executors in one NM ( 102/15) So according to this you
can configure below parameter and try again

spark.executor.memory 15g
spark.executor.cores 9
spark.executor.instances 24

Please make sure that Yarn RM shows these 24 containers running(Excluding AM
container).

*B. Table Optimization *
1. Out of 5 table one table yuan_yuan10_STORE_SALES is Big table having
~1.4 Billion Records and it has columns
SS_SOLD_DATE_SK,SS_ITEM_SK,SS_CUSTOMER_SK as DICTIONARY_INCLUDE , is any of
the column is High cardinality columns ? for High cardinality columns better
to have DICTIONARY_EXCLUDE you can check size of Metedata Folder in carbon
store location.

2. ss_sold_date_sk has between filter ,so better to have Int data type of
it.

*C. Information For Next Analysis *

Please provide below detail
1. Can you check SparkUI and check how much time CarbonScan RDD Stage has
taken and how much time Aggregate Stage taken ? You can Check DAG . Or send
spark event files or SparkUI snapshot .
2. How many task for each Stage ?
3. In Driver How much time spend between Parsing and below statement
18/04/01 20:49:01 INFO CarbonScanRDD:
Identified no.of.blocks: 1,
no.of.tasks: 1,
no.of.nodes: 0,
parallelism: 1

4. Configure enable.query.statistics=true in carbon.proeprties and
Send/Analyze the Time taken by Carbon in executor side.(like time spend in
IO/Dictionary load..)

For Data loading :- If data are loading with Local Sort then your
configuration is correct (1 Node ,1 Executor)

Please check with Solution A. it may solve issue, if still exists then
provide requested Information in PointC .

Thanks
Babu

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/