Apache CarbonData Dev Mailing List archive

Re: Problem on carbondata quering performance tuning

Posted by BabuLal on Apr 02, 2018; 6:25pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Problem-on-carbondata-quering-performance-tuning-tp44031p44120.html

Hi

Thanks for using Carbondata.

Based on Information you provided , Please try below solutions /Points.

*A. Tune Resource Allocation *

You have 55 core/NM , and given spark.executor.cores= 54 which means
one NM will have only one Executor and total you will have only 4 Executor
even you have given spark.executor.instances 10 .
For Query Execution we need to have more Executor .
Cluster Capacity :-
Total NM=4
Core/NM=55
Memory/NM=102

Ideally(most of the case) per Executor 12-15 GB memory enough .Based on
this we can open 6 Executors in one NM ( 102/15) So according to this you
can configure below parameter and try again

spark.executor.memory 15g
spark.executor.cores 9
spark.executor.instances 24

Please make sure that Yarn RM shows these 24 containers running(Excluding AM
container).

*B. Table Optimization *
1. Out of 5 table one table yuan_yuan10_STORE_SALES is Big table having
~1.4 Billion Records and it has columns
SS_SOLD_DATE_SK,SS_ITEM_SK,SS_CUSTOMER_SK as DICTIONARY_INCLUDE , is any of
the column is High cardinality columns ? for High cardinality columns better
to have DICTIONARY_EXCLUDE you can check size of Metedata Folder in carbon
store location.

2. ss_sold_date_sk has between filter ,so better to have Int data type of
it.

*C. Information For Next Analysis *

Please provide below detail
1. Can you check SparkUI and check how much time CarbonScan RDD Stage has
taken and how much time Aggregate Stage taken ? You can Check DAG . Or send
spark event files or SparkUI snapshot .
2. How many task for each Stage ?
3. In Driver How much time spend between Parsing and below statement
18/04/01 20:49:01 INFO CarbonScanRDD:
Identified no.of.blocks: 1,
no.of.tasks: 1,
no.of.nodes: 0,
parallelism: 1

4. Configure enable.query.statistics=true in carbon.proeprties and
Send/Analyze the Time taken by Carbon in executor side.(like time spend in
IO/Dictionary load..)

For Data loading :- If data are loading with Local Sort then your
configuration is correct (1 Node ,1 Executor)

Please check with Solution A. it may solve issue, if still exists then
provide requested Information in PointC .

Thanks
Babu

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/