Apache CarbonData Dev Mailing List archive

Re: GC problem and performance refine problem

Posted by Anning Luo-2 on Nov 17, 2016; 8:59am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/GC-problem-and-performance-refine-problem-tp2844p2968.html

Hi Kumar Vishal,

I redo some experiment with a detail driver log. The logs are in the
accessory. And every log run the same query for 10 times.

1. “driverlog1” is under same condition as previous I did.

2. “driverlog2”: in this experiment, before load data, I shuffle the
csv file to make the rows with same attribute will distribute uniform
between all csv files. i.e. if we use the query filter on the origin csv
file, the result quantity is almost same for every csv file. And every csv
file is one block. So every partition for the loader of carbondata has
uniform data. That method works well:

Time cost of first experiment: 103543, 49363, 125430, 22138, 21626,
27905, 22086, 21332, 23501, 16715

Time cost of second experiment:62466, 38709, 24700, 18767, 18325, 15887,
16014, 15452, 17440, 15563

3. “driverlog3”: First, use the shuffled csv data as 2th experiment.
Second, modify the create table sql:

*OLD CREATE TABLE:*

CREATE TABLE IF NOT EXISTS Table1

(* h Int, g Int, d String, f Int, e Int,*

a Int, b Int, …(extra near 300 columns)

STORED BY 'org.apache.carbondata.format'TBLPROPERTIES(

"NO_INVERTED_INDEX”=“a”,"NO_INVERTED_INDEX”=“b”,

…(extra near 300 columns)

"DICTIONARY_INCLUDE”=“a”,

"DICTIONARY_INCLUDE”=“b”,

…(extra near 300 columns)

)

*NEW CREATE TABLE:*

CREATE TABLE IF NOT EXISTS Table1

(* h Int, g Int, f Int, e Int, d String*

a Int, b Int, …(extra near 300 columns)

STORED BY 'org.apache.carbondata.format' TBLPROPERTIES(

DICTIONARY_INCLUDE"="h, g, f , e , d"

)

The h,g,f,e,d are the columns used in filter. And their distinct value
increase.

Time cost of third experiment: 57088, 37105, 41588, 43540, 38143, 35581,
34040, 35886, 34610, 32589

The time cost of third experiment is longer than others. It is because that
I use the “h=66 and g = 67 and f = 12 and e != 5 and d >’aaaa’ and d <
‘bbbb’” as the filter condition. So considering with the mdk build order,
the result rows will be continuous in the file. So there is some blocket
matching the filter condition totally, while others matching zero row. But
the scheduler on driver did not filter out the zero row matched blockets.
It seems the min/max filter does not work correctly on driver side when
getSplit from InputFormat. There are some case on executor(blocket size is
120k):

+------------------+----------------+--------------------+--
--------------+---------------+-----------+-------------------+

| task_id|load_blocks_time|load_dictionary_time|scan_blocks_
time|scan_blocks_num|result_size|total_executor_time|

+------------------+----------------+--------------------+--
--------------+---------------+-----------+-------------------+

|603288369228186_62| 426 | 2 | 28
| 3 | 0 | 470 |

+------------------+----------------+--------------------+--
--------------+---------------+-----------+-------------------+

16/11/16 21:36:25 INFO impl.FilterScanner: pool-30-thread-1 [INDEX] index
filter : 113609/120000

+-------------------+----------------+--------------------+-
---------------+---------------+-----------+-------------------+

| task_id|load_blocks_time|load_dictionary_time|scan_blocks_
time|scan_blocks_num|result_size|total_executor_time|

+-------------------+----------------+--------------------+-
---------------+---------------+-----------+-------------------+

|603288369228186_428| 380 | 4 | 661
| 3 | 113609 | 13634 |

+-------------------+----------------+--------------------+-
---------------+---------------+-----------+-------------------+

In first case result_size is zero, but the blocket is still put in one
task. That make some task do nothing, but others suffer a heavy work.

In second case, example only one blocket has data(4 or 6 blockets for one
task). I add the log about invert index filter ratio like “[INDEX] index
filter : <result count>/<total block count>”

So, how could I make the min/max filter work correctly in driver side?

Another question: use multi-line in TBLPROPERTIES dose not work correctly
like “"DICTIONARY_INCLUDE”=“a”, "DICTIONARY_INCLUDE”=“b”” in the old create
table sql. Only the column in the last DICTIONARY_INCLUDE declare is add
into the dimension. The new one works correctly. But the old way did not
throw any exception.

Further, I think from the 2th experiment balancing the data is important.
So I will change the blocket size to 2k rows if the min/max filter could
work on driver side. I have not changed the int type to double type for
measure, I will did it later.

2016-11-17 16:34 GMT+08:00 An Lan <[hidden email]>:

> Hi Kumar Vishal,
>
>
>
> I redo some experiment with a detail driver log. The logs are in the
> accessory. And every log run the same query for 10 times.
>
>
>
> 1. “driverlog1” is under same condition as previous I did.
>
> 2. “driverlog2”: in this experiment, before load data, I shuffle
> the csv file to make the rows with same attribute will distribute uniform
> between all csv files. i.e. if we use the query filter on the origin csv
> file, the result quantity is almost same for every csv file. And every csv
> file is one block. So every partition for the loader of carbondata has
> uniform data. That method works well:
>
> Time cost of first experiment: 103543, 49363, 125430, 22138,
> 21626, 27905, 22086, 21332, 23501, 16715
>
> Time cost of second experiment:62466, 38709, 24700, 18767, 18325, 15887,
> 16014, 15452, 17440, 15563
>
> 3. “driverlog3”: First, use the shuffled csv data as 2th
> experiment. Second, modify the create table sql:
>
>
>
> *OLD CREATE TABLE:*
>
> CREATE TABLE IF NOT EXISTS Table1
>
> (* h Int, g Int, d String, f Int, e Int,*
>
> a Int, b Int, …(extra near 300 columns)
>
> STORED BY 'org.apache.carbondata.format'TBLPROPERTIES(
>
> "NO_INVERTED_INDEX”=“a”,"NO_INVERTED_INDEX”=“b”,
>
> …(extra near 300 columns)
>
> "DICTIONARY_INCLUDE”=“a”,
>
> "DICTIONARY_INCLUDE”=“b”,
>
> …(extra near 300 columns)
>
> )
>
>
>
> *NEW CREATE TABLE:*
>
> CREATE TABLE IF NOT EXISTS Table1
>
> (* h Int, g Int, f Int, e Int, d String*
>
> a Int, b Int, …(extra near 300 columns)
>
> STORED BY 'org.apache.carbondata.format' TBLPROPERTIES(
>
> DICTIONARY_INCLUDE"="h, g, f , e , d"
>
> )
>
> The h,g,f,e,d are the columns used in filter. And their distinct value
> increase.
>
> Time cost of third experiment: 57088, 37105, 41588, 43540, 38143, 35581,
> 34040, 35886, 34610, 32589
>
> The time cost of third experiment is longer than others. It is because
> that I use the “h=66 and g = 67 and f = 12 and e != 5 and d >’aaaa’ and d <
> ‘bbbb’” as the filter condition. So considering with the mdk build order,
> the result rows will be continuous in the file. So there is some blocket
> matching the filter condition totally, while others matching zero row. But
> the scheduler on driver did not filter out the zero row matched blockets.
> It seems the min/max filter does not work correctly on driver side when
> getSplit from InputFormat. There are some case on executor(blocket size is
> 120k):
>
> +------------------+----------------+--------------------+--
> --------------+---------------+-----------+-------------------+
>
> | task_id|load_blocks_time|load_dictionary_time|scan_blocks_
> time|scan_blocks_num|result_size|total_executor_time|
>
> +------------------+----------------+--------------------+--
> --------------+---------------+-----------+-------------------+
>
> |603288369228186_62| 426 | 2 | 28
> | 3 | 0 | 470 |
>
> +------------------+----------------+--------------------+--
> --------------+---------------+-----------+-------------------+
>
>
>
> 16/11/16 21:36:25 INFO impl.FilterScanner: pool-30-thread-1 [INDEX] index
> filter : 113609/120000
>
> +-------------------+----------------+--------------------+-
> ---------------+---------------+-----------+-------------------+
>
> | task_id|load_blocks_time|load_dictionary_time|scan_blocks_
> time|scan_blocks_num|result_size|total_executor_time|
>
> +-------------------+----------------+--------------------+-
> ---------------+---------------+-----------+-------------------+
>
> |603288369228186_428| 380 | 4 |
> 661 | 3 | 113609 | 13634 |
>
> +-------------------+----------------+--------------------+-
> ---------------+---------------+-----------+-------------------+
>
> In first case result_size is zero, but the blocket is still put in one
> task. That make some task do nothing, but others suffer a heavy work.
>
> In second case, example only one blocket has data(4 or 6 blockets for one
> task). I add the log about invert index filter ratio like “[INDEX] index
> filter : <result count>/<total block count>”
>
> So, how could I make the min/max filter work correctly in driver side?
>
>
>
> Another question: use multi-line in TBLPROPERTIES dose not work correctly
> like “"DICTIONARY_INCLUDE”=“a”, "DICTIONARY_INCLUDE”=“b”” in the old
> create table sql. Only the column in the last DICTIONARY_INCLUDE declare is
> add into the dimension. The new one works correctly. But the old way did
> not throw any exception.
>
>
>
> Further, I think from the 2th experiment balancing the data is important.
> So I will change the blocket size to 2k rows if the min/max filter could
> work on driver side. I have not changed the int type to double type for
> measure, I will did it later.
>
> 2016-11-16 21:55 GMT+08:00 Kumar Vishal <[hidden email]>:
>
>> Hi An Lan,
>> Please go through discussion *how to create carbon table for better query
>> performance* shared by Bill. I am coping the same in this mail for you
>> reference.
>>
>>
>> Discussion how to crate the CarbonData table with good performance
>> Suggestion to create Carbon table
>> Recently we used CarbonData to do the performance in Telecommunication
>> filed
>> and summarize some of the Suggestions while creating the CarbonData
>> table.We
>> have tables which range from 10 thousand rows to 10 billion rows and have
>> from 100 columns to 300 columns. Following are some of the columns used in
>> the table.
>> Column name Data type Cardinality Attribution
>> msisdn String 30 million dimension
>> BEGIN_TIME bigint 10 thousand dimension
>> HOST String 1 million dimension
>> Dime_1 String 1 thousand dimension
>> counter_1 numeric(20,0) NA measure
>> ... ... NA ...
>> counter_100 numeric(20,0) NA measure
>> We have about more than 50 test cases; according to the test case we
>> summarize some suggestion to create the table which can have a better
>> query
>> performance. 1. Put the frequently-used column filter in the beginning.
>> For
>> example, MSISDN filter is used in most of the query then put the MSISDN in
>> the first column. The create table command can be as follows, the query
>> which has MSISDN as a filter will be good (because the MSISDN is high
>> cardinality, if create table like this the compress ratio will be
>> decreased)/create table carbondata_table(//msisdn String,//...//)STORED BY
>> 'org.apache.carbondata.format' //TBLPROPERTIES (
>> 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');/2.
>> If
>> has
>> multiple column which is frequently-use in the filter, put it to the front
>> in the order as low cardinality to high cardinality.For example if msisdn,
>> host and dime_1 is frequently-used column, the table column order can be
>> like dime_1->host->msisdn, because the dime_1 cardinality is low. Create
>> table command can be as follows. This will increase the compression ratio
>> and good performance for filter on dime_1, host and msisdn. /create table
>> carbondata_table(Dime_1 String,HOST String,MSISDN String,...)STORED BY
>> 'org.apache.carbondata.format' /TBLPROPERTIES (
>> 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');3.
>> If
>> no column is frequent-use in filter, then can put all the dimension column
>> order as from low cardinality to high cardinality. Create table command
>> can
>> be as following: /create table carbondata_table(Dime_1 String,BEGIN_TIME
>> bigintHOST String,MSISDN String,...)STORED BY
>> 'org.apache.carbondata.format'
>> TBLPROPERTIES (
>> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_
>> INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');/4.
>> For measure that needs no high accuracy, then no need to use numeric(20,0)
>> data type, suggestion is to use double to replace it than will increase
>> the
>> query performance. If one test case uses double to replace the numeric
>> (20,
>> 0) the query improve 5 times from 15 second to 3 second. Create table
>> command can be as follows. /create table carbondata_table(Dime_1
>> String,BEGIN_TIME bigintHOST String,MSISDN String,counter_1
>> double,counter_2
>> double,...counter_100 double,)STORED BY 'org.apache.carbondata.format'
>> TBLPROPERTIES (
>> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_
>> INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');/5.
>> If the column which is always incremental like start_time. For example one
>> scenario: every day we will load data into carbon and the start_time is
>> incremental for each load. For this scenario you can put the start_time
>> column in the back of dimension, because always incremental value can use
>> the min/max index well always. Create table command can be as following.
>> /create table carbondata_table(Dime_1 String,HOST String,MSISDN
>> String,counter_1 double,counter_2 double,BEGIN_TIME bigint,...counter_100
>> double,)STORED BY 'org.apache.carbondata.format' TBLPROPERTIES (
>> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_
>> INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');/One
>> more is for the dimension whether dictionary is needed or not, we suggest
>> if
>> the cardinality higher than 50 thousand do not put it as dictionary
>> column.
>> If high cardinality column put as dictionary will impact the load
>> performance.
>>
>> -Regards
>> Kumar Vishal
>>
>> On Wed, Nov 16, 2016 at 7:19 PM, Kumar Vishal <[hidden email]>
>> wrote:
>>
>> > Hi An Lan,
>> > For detail logging you need to add log4j configuration in
>> > spark-default.conf for both driver and executor.
>> > * spark.driver.extraJavaOption = -Dlog4j.configuration=file:/<f
>> ilepath>*
>> > * spark.executor.extraJavaOption = -Dlog4j.configuration=file:/<f
>> ilepath>*
>> >
>> > Please make sure in* log4j.properties* *log4j.rootCategory* is *INFO*
>> >
>> > -Regards
>> > Kumar Vishal
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Nov 16, 2016 at 2:06 PM, An Lan <[hidden email]> wrote:
>> >
>> >> Hi Kumar Vishal,
>> >>
>> >> Thanks for your suggestion.
>> >> The driver log not contain block distribution log by default. How
>> could I
>> >> open it?
>> >> And how does the order of the dimensions be decided?
>> >>
>> >> 2016-11-16 15:14 GMT+08:00 Kumar Vishal <[hidden email]>:
>> >>
>> >> > Hi An Lan,
>> >> > Data is already distributed, in this case may be one
>> >> > blocklet is returning more number of rows and other returning less
>> >> because
>> >> > of this some task will take more time.
>> >> >
>> >> > In driver log block distribution log is not present, so it is not
>> clear
>> >> > whether it is going for block distribution or blocklet distribution.
>> >> > distribution, can you please add the detail driver log.
>> >> >
>> >> > *Some suggestion:*
>> >> >
>> >> > 1. If column is String and you are not applying filter then in create
>> >> > statement add it in dictionary exclude this will avoid lookup in
>> >> > dictionary. If number of String column are less then better add it in
>> >> > dictionary exclude.
>> >> >
>> >> > 2. If column is a numeric column and you are not applying filter
>> then no
>> >> > need to add in dictionary include or exclude, so it will be a measure
>> >> > column.
>> >> >
>> >> > 3. Currently in carbon for measure column if you will create as a
>> double
>> >> > data type it will give more compression as currently value
>> compression
>> >> mode
>> >> > is support for double data type.
>> >> >
>> >> > -Regards
>> >> > Kumar Vishal
>> >> >
>> >> >
>> >> > On Wed, Nov 16, 2016 at 8:12 AM, An Lan <[hidden email]> wrote:
>> >> >
>> >> > > Hi Kumar Vishal,
>> >> > >
>> >> > > 1. I found the quantity of rows filtered out by invert index is not
>> >> > > uniform between different tasks and the difference is large. Some
>> task
>> >> > may
>> >> > > be 3~4k row after filtered, but the longer tasks may be 3~4w. When
>> >> most
>> >> > > longer task on same node, time cost will be more longer than
>> others.
>> >> So,
>> >> > is
>> >> > > there any way to balance the data and make rows distribute uniform
>> >> > between
>> >> > > task?
>> >> > > Now, the blocklet size is still 120k rows. 3k+ blocket in total.
>> Every
>> >> > > task has 6 blocket.
>> >> > > 2. I upload some logs last time. Is there some access problem about
>> >> it?
>> >> > Is
>> >> > > there any suggest or question about it?
>> >> > >
>> >> > > 2016-11-14 11:29 GMT+08:00 An Lan <[hidden email]>:
>> >> > >
>> >> > >> Hi Kumar Vishal,
>> >> > >>
>> >> > >>
>> >> > >> Driver and some executor logs are in the accessory. The same query
>> >> run
>> >> > >> for five times.
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> Time consume for every query:
>> >> > >>
>> >> > >> 67068ms, 45758ms, 26497ms, 22619ms, 21504ms
>> >> > >>
>> >> > >>
>> >> > >> The first stage for every query:
>> >> > >>
>> >> > >> (first query -> start from 0 stage
>> >> > >>
>> >> > >> second query -> 4
>> >> > >>
>> >> > >> third query -> 8
>> >> > >>
>> >> > >> fourth query -> 12
>> >> > >>
>> >> > >> five query -> 16)
>> >> > >>
>> >> > >> see "first stage of queries.png" in accessory
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> Longest task for every query:
>> >> > >>
>> >> > >> (Task and log relationship:
>> >> > >>
>> >> > >> stage 0, task 249 -> 1.log
>> >> > >>
>> >> > >> stage 4, task 420 -> 2.log
>> >> > >>
>> >> > >> stage 8, task 195 -> 3.log
>> >> > >>
>> >> > >> stage 12, task 350 -> 4.log
>> >> > >>
>> >> > >> stage 16, task 321 -> 5.log)
>> >> > >> see "longest task.png" in accessory
>> >> > >>
>> >> > >>
>> >> > >> longest tasks.png
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7dGY4cF9jRDk1QTg/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> first stage of queries.png
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7SVZsNnR2VEw2X0k/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> dirverlog
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7VllnVkNnenhyTTA/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> 1.log
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7eUdUWmlia1J4bk0/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> 2.log
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7ZUlGZHZVQ3phdmM/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> 3.log
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7UHpBQzREX3N5aEk/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> 4.log
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7NVctMmYwNldCVEk/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >> 5.log
>> >> > >> <https://drive.google.com/file/d/0B1XM6KeI1nB7ODhHNE5sSGNfOVE/
>> >> > view?usp=drive_web>
>> >> > >>
>> >> > >>
>> >> > >> 2016-11-14 11:22 GMT+08:00 An Lan <[hidden email]>:
>> >> > >>
>> >> > >>> Hi,
>> >> > >>>
>> >> > >>>
>> >> > >>>
>> >> > >>> Driver and some executor logs are in the accessory. The same
>> query
>> >> run
>> >> > >>> for five times.
>> >> > >>>
>> >> > >>>
>> >> > >>>
>> >> > >>> Time consume for every query:
>> >> > >>>
>> >> > >>> 67068ms, 45758ms, 26497ms, 22619ms, 21504ms
>> >> > >>>
>> >> > >>>
>> >> > >>>
>> >> > >>> The first stage for every query:
>> >> > >>>
>> >> > >>> (first query -> start from 0 stage
>> >> > >>>
>> >> > >>> second query -> 4
>> >> > >>>
>> >> > >>> third query -> 8
>> >> > >>>
>> >> > >>> fourth query -> 12
>> >> > >>>
>> >> > >>> five query -> 16)
>> >> > >>>
>> >> > >>>
>> >> > >>> Longest task for every query:
>> >> > >>>
>> >> > >>> (Task and log relationship:
>> >> > >>>
>> >> > >>> stage 0, task 249 -> 1.log
>> >> > >>>
>> >> > >>> stage 4, task 420 -> 2.log
>> >> > >>>
>> >> > >>> stage 8, task 195 -> 3.log
>> >> > >>>
>> >> > >>> stage 12, task 350 -> 4.log
>> >> > >>>
>> >> > >>> stage 16, task 321 -> 5.log)
>> >> > >>>
>> >> > >>>
>> >> > >>>
>> >> > >>> 2016-11-11 20:25 GMT+08:00 Kumar Vishal <
>> [hidden email]
>> >> >:
>> >> > >>>
>> >> > >>>> What is the time difference between first time and second time
>> >> query
>> >> > as
>> >> > >>>> second time it will read from os cache,so i think there wont be
>> >> any IO
>> >> > >>>> bottleneck.
>> >> > >>>>
>> >> > >>>> Can u provide driver log and executor log for task which are
>> taking
>> >> > more
>> >> > >>>> time.
>> >> > >>>>
>> >> > >>>>
>> >> > >>>> -Regards
>> >> > >>>> Kumar Vishal
>> >> > >>>>
>> >> > >>>> On Fri, Nov 11, 2016 at 3:21 PM, An Lan <[hidden email]>
>> >> wrote:
>> >> > >>>>
>> >> > >>>> > 1. I have set --num-executors=100 in all test. The image write
>> >> 100
>> >> > >>>> nodes
>> >> > >>>> > means one executor for one node, and when there is 64 node,
>> there
>> >> > >>>> maybe two
>> >> > >>>> > or three executor on one node. The total executor number is
>> >> always
>> >> > >>>> 100.
>> >> > >>>> >
>> >> > >>>> > 2. I do not find the property enable.blocklet.distribution in
>> >> > 0.1.1. I
>> >> > >>>> > found it in 0.2.0. I am testing on 0.1.1, and will try it
>> later.
>> >> If
>> >> > >>>> you
>> >> > >>>> > concern locality level, it seems that the most long query is
>> not
>> >> > >>>> caused by
>> >> > >>>> > IO problem. Statistic of the long task statistic like this:
>> >> > >>>> >
>> >> > >>>> > +--------------------+----------------+--------------------+
>> >> > >>>> > ----------------+---------------+-----------+---------------
>> >> ----+
>> >> > >>>> > | task_id|load_blocks_time|load
>> >> > >>>> _dictionary_time|scan_blocks_
>> >> > >>>> > time|scan_blocks_num|result_size|total_executor_time|
>> >> > >>>> > +--------------------+----------------+--------------------+
>> >> > >>>> > ----------------+---------------+-----------+---------------
>> >> ----+
>> >> > >>>> > |4199754456147478_134| 1 | 32 |
>> >> > >>>> > 3306 | 3 | 42104 | 13029 |
>> >> > >>>> > +--------------------+----------------+--------------------+
>> >> > >>>> > ----------------+---------------+-----------+---------------
>> >> ----+
>> >> > >>>> >
>> >> > >>>> > I have suffer from IO, but after configure speculation, most
>> >> tasks
>> >> > >>>> > with a long IO will be resend.
>> >> > >>>> >
>> >> > >>>> >
>> >> > >>>> > 3. distinct value:
>> >> > >>>> >
>> >> > >>>> > g: 11
>> >> > >>>> >
>> >> > >>>> > h: 3
>> >> > >>>> >
>> >> > >>>> > f: 7
>> >> > >>>> >
>> >> > >>>> > e: 3
>> >> > >>>> >
>> >> > >>>> > d: 281
>> >> > >>>> >
>> >> > >>>> > 4. In the query, only e, f ,g, h, d and A are used in filter,
>> >> others
>> >> > >>>> are
>> >> > >>>> > not used. So I think others used in the aggregation are no
>> need
>> >> > added
>> >> > >>>> for
>> >> > >>>> > index.
>> >> > >>>> >
>> >> > >>>> > 5. I have write the e, f, g, h, d in most left just after
>> "create
>> >> > >>>> table..."
>> >> > >>>> >
>> >> > >>>> > 2016-11-11 15:08 GMT+08:00 Kumar Vishal <
>> >> [hidden email]
>> >> > >:
>> >> > >>>> >
>> >> > >>>> > > Hi An Lan,
>> >> > >>>> > >
>> >> > >>>> > > Please confirm below things.
>> >> > >>>> > >
>> >> > >>>> > > 1. Is dynamic executor is enabled?? If it is enabled can u
>> >> > disabled
>> >> > >>>> and
>> >> > >>>> > > try(this is to check is there any impact with dynamic
>> executor)
>> >> > >>>> > > for disabling dynamic executor you need set in
>> >> > >>>> spark-default.conf
>> >> > >>>> > > --num-executors=100
>> >> > >>>> > >
>> >> > >>>> > > 2. Can u please set in below property
>> >> > enable.blocklet.distribution=f
>> >> > >>>> alse
>> >> > >>>> > > and execute the query.
>> >> > >>>> > >
>> >> > >>>> > > 3. Cardinality of each column.
>> >> > >>>> > >
>> >> > >>>> > > 4. Any reason why you are setting "NO_INVERTED_INDEX”=“a” ??
>> >> > >>>> > >
>> >> > >>>> > > 5. Can u keep all the columns which is present in filter on
>> >> left
>> >> > >>>> side so
>> >> > >>>> > > less no of blocks will identified during block pruning and
>> it
>> >> will
>> >> > >>>> > improve
>> >> > >>>> > > the query performance.
>> >> > >>>> > >
>> >> > >>>> > >
>> >> > >>>> > > -Regards
>> >> > >>>> > > Kumar Vishal
>> >> > >>>> > >
>> >> > >>>> > > On Fri, Nov 11, 2016 at 11:59 AM, An Lan <
>> [hidden email]>
>> >> > >>>> wrote:
>> >> > >>>> > >
>> >> > >>>> > > > Hi Kumar Vishal,
>> >> > >>>> > > >
>> >> > >>>> > > > 1. Create table ddl:
>> >> > >>>> > > >
>> >> > >>>> > > > CREATE TABLE IF NOT EXISTS Table1
>> >> > >>>> > > >
>> >> > >>>> > > > (* h Int, g Int, d String, f Int, e Int,*
>> >> > >>>> > > >
>> >> > >>>> > > > a Int, b Int, …(extra near 300 columns)
>> >> > >>>> > > >
>> >> > >>>> > > > STORED BY 'org.apache.carbondata.format'
>> >> > >>>> > > > TBLPROPERTIES(
>> >> > >>>> > > >
>> >> > >>>> > > > "NO_INVERTED_INDEX”=“a”,
>> >> > >>>> > > > "NO_INVERTED_INDEX”=“b”,
>> >> > >>>> > > >
>> >> > >>>> > > > …(extra near 300 columns)
>> >> > >>>> > > >
>> >> > >>>> > > > "DICTIONARY_INCLUDE”=“a”,
>> >> > >>>> > > >
>> >> > >>>> > > > "DICTIONARY_INCLUDE”=“b”,
>> >> > >>>> > > >
>> >> > >>>> > > > …(extra near 300 columns)
>> >> > >>>> > > >
>> >> > >>>> > > > )
>> >> > >>>> > > >
>> >> > >>>> > > > 2. 3. There more than hundreds node in the cluster,
>> but
>> >> > >>>> cluster
>> >> > >>>> > is
>> >> > >>>> > > > used mixed with other application. Some time when node is
>> >> > >>>> enough, we
>> >> > >>>> > > will
>> >> > >>>> > > > get 100 distinct node.
>> >> > >>>> > > >
>> >> > >>>> > > > 4. I give a statistic of task time during once query
>> and
>> >> > mark
>> >> > >>>> > > > distinct nodes below:
>> >> > >>>> > > >
>> >> > >>>> > > > [image: 内嵌图片 1]
>> >> > >>>> > > >
>> >> > >>>> > > >
>> >> > >>>> > > >
>> >> > >>>> > > >
>> >> > >>>> > > > 2016-11-10 23:52 GMT+08:00 Kumar Vishal <
>> >> > >>>> [hidden email]>:
>> >> > >>>> > > >
>> >> > >>>> > > >> Hi Anning Luo,
>> >> > >>>> > > >>
>> >> > >>>> > > >> Can u please provide below details.
>> >> > >>>> > > >>
>> >> > >>>> > > >> 1.Create table ddl.
>> >> > >>>> > > >> 2.Number of node in you cluster setup.
>> >> > >>>> > > >> 3. Number of executors per node.
>> >> > >>>> > > >> 4. Query statistics.
>> >> > >>>> > > >>
>> >> > >>>> > > >> Please find my comments in bold.
>> >> > >>>> > > >>
>> >> > >>>> > > >> Problem:
>> >> > >>>> > > >> 1. GC problem. We suffer a
>> 20%~30%
>> >> GC
>> >> > >>>> time
>> >> > >>>> > for
>> >> > >>>> > > >> some task in first stage after a lot of parameter
>> >> refinement.
>> >> > We
>> >> > >>>> now
>> >> > >>>> > use
>> >> > >>>> > > >> G1
>> >> > >>>> > > >> GC in java8. GC time will double if use CMS. The main GC
>> >> time
>> >> > is
>> >> > >>>> spent
>> >> > >>>> > > on
>> >> > >>>> > > >> young generation GC. Almost half memory of young
>> generation
>> >> > will
>> >> > >>>> be
>> >> > >>>> > copy
>> >> > >>>> > > >> to
>> >> > >>>> > > >> old generation. It seems lots of object has a long life
>> >> than GC
>> >> > >>>> period
>> >> > >>>> > > and
>> >> > >>>> > > >> the space is not be reuse(as concurrent GC will release
>> it
>> >> > >>>> later).
>> >> > >>>> > When
>> >> > >>>> > > we
>> >> > >>>> > > >> use a large Eden(>=1G for example), once GC time will be
>> >> > >>>> seconds. If
>> >> > >>>> > set
>> >> > >>>> > > >> Eden little(256M for example), once GC time will be
>> hundreds
>> >> > >>>> > > milliseconds,
>> >> > >>>> > > >> but more frequency and total is still seconds. Is there
>> any
>> >> way
>> >> > >>>> to
>> >> > >>>> > > lessen
>> >> > >>>> > > >> the GC time? (We don’t consider the first query and
>> second
>> >> > query
>> >> > >>>> in
>> >> > >>>> > this
>> >> > >>>> > > >> case.)
>> >> > >>>> > > >>
>> >> > >>>> > > >> *How many node are present in your cluster setup?? If
>> nodes
>> >> are
>> >> > >>>> less
>> >> > >>>> > > >> please
>> >> > >>>> > > >> reduce the number of executors per node.*
>> >> > >>>> > > >>
>> >> > >>>> > > >> 2. Performance refine problem.
>> Row
>> >> > >>>> number
>> >> > >>>> > after
>> >> > >>>> > > >> being filtered is not uniform. Some node maybe heavy. It
>> >> spend
>> >> > >>>> more
>> >> > >>>> > time
>> >> > >>>> > > >> than other node. The time of one task is 4s ~ 16s. Is any
>> >> > method
>> >> > >>>> to
>> >> > >>>> > > refine
>> >> > >>>> > > >> it?
>> >> > >>>> > > >>
>> >> > >>>> > > >> 3. Too long time for first and
>> >> second
>> >> > >>>> query.
>> >> > >>>> > I
>> >> > >>>> > > >> know dictionary and some index need to be loaded for the
>> >> first
>> >> > >>>> time.
>> >> > >>>> > But
>> >> > >>>> > > >> after I trying use query below to preheat it, it still
>> >> spend a
>> >> > >>>> lot of
>> >> > >>>> > > >> time.
>> >> > >>>> > > >> How could I preheat the query correctly?
>> >> > >>>> > > >> select Aarray, a, b, c… from
>> Table1
>> >> > where
>> >> > >>>> > Aarray
>> >> > >>>> > > >> is
>> >> > >>>> > > >> not null and d = “sss” and e !=22 and f = 33 and g = 44
>> and
>> >> h =
>> >> > >>>> 55
>> >> > >>>> > > >>
>> >> > >>>> > > >> *Currently we are working on first time query
>> improvement.
>> >> For
>> >> > >>>> now you
>> >> > >>>> > > can
>> >> > >>>> > > >> run select count(*) or count(column), so all the blocks
>> get
>> >> > >>>> loaded and
>> >> > >>>> > > >> then
>> >> > >>>> > > >> you can run the actual query.*
>> >> > >>>> > > >>
>> >> > >>>> > > >>
>> >> > >>>> > > >> 4. Any other suggestion to lessen the query
>> >> time?
>> >> > >>>> > > >>
>> >> > >>>> > > >>
>> >> > >>>> > > >> Some suggestion：
>> >> > >>>> > > >> The log by class QueryStatisticsRecorder give
>> >> me a
>> >> > >>>> good
>> >> > >>>> > > means
>> >> > >>>> > > >> to find the neck bottle, but not enough. There still some
>> >> > metric
>> >> > >>>> I
>> >> > >>>> > think
>> >> > >>>> > > >> is
>> >> > >>>> > > >> very useful:
>> >> > >>>> > > >> 1. filter ratio. i.e.. not only result_size
>> but
>> >> > also
>> >> > >>>> the
>> >> > >>>> > > >> origin
>> >> > >>>> > > >> size so we could know how many data is filtered.
>> >> > >>>> > > >> 2. IO time. The scan_blocks_time is not
>> enough.
>> >> If
>> >> > >>>> it is
>> >> > >>>> > > high,
>> >> > >>>> > > >> we know somethings wrong, but not know what cause that
>> >> problem.
>> >> > >>>> The
>> >> > >>>> > real
>> >> > >>>> > > >> IO
>> >> > >>>> > > >> time for data is not be provided. As there may be several
>> >> file
>> >> > >>>> for one
>> >> > >>>> > > >> partition, know the program slow is caused by datanode or
>> >> > >>>> executor
>> >> > >>>> > > itself
>> >> > >>>> > > >> give us intuition to find the problem.
>> >> > >>>> > > >> 3. The TableBlockInfo for task. I log it by
>> >> myself
>> >> > >>>> when
>> >> > >>>> > > >> debugging. It tell me how many blocklets is locality. The
>> >> spark
>> >> > >>>> web
>> >> > >>>> > > >> monitor
>> >> > >>>> > > >> just give a locality level, but may be only one blocklet
>> is
>> >> > >>>> locality.
>> >> > >>>> > > >>
>> >> > >>>> > > >>
>> >> > >>>> > > >> -Regards
>> >> > >>>> > > >> Kumar Vishal
>> >> > >>>> > > >>
>> >> > >>>> > > >> On Thu, Nov 10, 2016 at 8:55 PM, An Lan <
>> [hidden email]
>> >> >
>> >> > >>>> wrote:
>> >> > >>>> > > >>
>> >> > >>>> > > >> > Hi,
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > We are using carbondata to build our table and running
>> >> query
>> >> > in
>> >> > >>>> > > >> > CarbonContext. We have some performance problem during
>> >> > >>>> refining the
>> >> > >>>> > > >> system.
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *Background*：
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *cluster*： 100 executor，5
>> >> task/executor,
>> >> > >>>> 10G
>> >> > >>>> > > >> > memory/executor
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *data*: 60+GB(per one
>> >> replica)
>> >> > as
>> >> > >>>> > carbon
>> >> > >>>> > > >> data
>> >> > >>>> > > >> > format, 600+MB/file * 100 file, 300+columns,
>> 300+million
>> >> rows
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *sql example:*
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > select A,
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > sum(a),
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > sum(b),
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > sum(c),
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > …( extra 100
>> >> > aggregation
>> >> > >>>> like
>> >> > >>>> > > >> > sum(column))
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > from Table1
>> LATERAL
>> >> > VIEW
>> >> > >>>> > > >> > explode(split(Aarray, ‘*;*’)) ATable AS A
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > where A is not
>> null
>> >> and
>> >> > >>>> d >
>> >> > >>>> > > >> “ab:c-10”
>> >> > >>>> > > >> > and d < “h:0f3s” and e!=10 and f=22 and g=33 and h=44
>> >> GROUP
>> >> > BY
>> >> > >>>> A
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *target query time*: <10s
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *current query time*: 15s ~ 25s
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *scene:* OLAP system. <100
>> >> > queries
>> >> > >>>> every
>> >> > >>>> > > >> day.
>> >> > >>>> > > >> > Concurrency number is not high. Most time cpu is idle,
>> so
>> >> > this
>> >> > >>>> > service
>> >> > >>>> > > >> will
>> >> > >>>> > > >> > run with other program. The service will run for long
>> >> time.
>> >> > We
>> >> > >>>> could
>> >> > >>>> > > not
>> >> > >>>> > > >> > occupy a very large memory for every executor.
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *refine*: I have build
>> index
>> >> and
>> >> > >>>> > > >> dictionary on
>> >> > >>>> > > >> > d, e, f, g, h and build dictionary on all other
>> >> aggregation
>> >> > >>>> > > >> columns(i.e. a,
>> >> > >>>> > > >> > b, c, …100+ columns). And make sure there is one
>> segment
>> >> for
>> >> > >>>> total
>> >> > >>>> > > >> data. I
>> >> > >>>> > > >> > have open the speculation(quantile=0.5, interval=250,
>> >> > >>>> > multiplier=1.2).
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > Time is mainly spent on first stage before shuffling.
>> As
>> >> 95%
>> >> > >>>> data
>> >> > >>>> > will
>> >> > >>>> > > >> be
>> >> > >>>> > > >> > filtered out, the shuffle process spend little time. In
>> >> first
>> >> > >>>> stage,
>> >> > >>>> > > >> most
>> >> > >>>> > > >> > task complete in less than 10s. But there still be
>> near 50
>> >> > >>>> tasks
>> >> > >>>> > > longer
>> >> > >>>> > > >> > than 10s. Max task time in one query may be 12~16s.
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *Problem:*
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 1. GC problem. We suffer a 20%~30% GC time for
>> >> some
>> >> > >>>> task in
>> >> > >>>> > > >> first
>> >> > >>>> > > >> > stage after a lot of parameter refinement. We now use
>> G1
>> >> GC
>> >> > in
>> >> > >>>> > java8.
>> >> > >>>> > > GC
>> >> > >>>> > > >> > time will double if use CMS. The main GC time is spent
>> on
>> >> > young
>> >> > >>>> > > >> generation
>> >> > >>>> > > >> > GC. Almost half memory of young generation will be
>> copy to
>> >> > old
>> >> > >>>> > > >> generation.
>> >> > >>>> > > >> > It seems lots of object has a long life than GC period
>> and
>> >> > the
>> >> > >>>> space
>> >> > >>>> > > is
>> >> > >>>> > > >> not
>> >> > >>>> > > >> > be reuse(as concurrent GC will release it later). When
>> we
>> >> > use a
>> >> > >>>> > large
>> >> > >>>> > > >> > Eden(>=1G for example), once GC time will be seconds.
>> If
>> >> set
>> >> > >>>> Eden
>> >> > >>>> > > >> > little(256M for example), once GC time will be hundreds
>> >> > >>>> > milliseconds,
>> >> > >>>> > > >> but
>> >> > >>>> > > >> > more frequency and total is still seconds. Is there any
>> >> way
>> >> > to
>> >> > >>>> > lessen
>> >> > >>>> > > >> the
>> >> > >>>> > > >> > GC time? (We don’t consider the first query and second
>> >> query
>> >> > >>>> in this
>> >> > >>>> > > >> case.)
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 2. Performance refine problem. Row number
>> after
>> >> > being
>> >> > >>>> > > >> filtered is
>> >> > >>>> > > >> > not uniform. Some node maybe heavy. It spend more time
>> >> than
>> >> > >>>> other
>> >> > >>>> > > node.
>> >> > >>>> > > >> The
>> >> > >>>> > > >> > time of one task is 4s ~ 16s. Is any method to refine
>> it?
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 3. Too long time for first and second query.
>> I
>> >> know
>> >> > >>>> > > dictionary
>> >> > >>>> > > >> > and some index need to be loaded for the first time.
>> But
>> >> > after
>> >> > >>>> I
>> >> > >>>> > > trying
>> >> > >>>> > > >> use
>> >> > >>>> > > >> > query below to preheat it, it still spend a lot of
>> time.
>> >> How
>> >> > >>>> could I
>> >> > >>>> > > >> > preheat the query correctly?
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > select Aarray, a, b, c… from Table1 where
>> >> > Aarray
>> >> > >>>> is
>> >> > >>>> > not
>> >> > >>>> > > >> null
>> >> > >>>> > > >> > and d = “sss” and e !=22 and f = 33 and g = 44 and h =
>> 55
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 4. Any other suggestion to lessen the query
>> >> time?
>> >> > >>>> > > >> >
>> >> > >>>> > > >> >
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > Some suggestion：
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > The log by class QueryStatisticsRecorder
>> give
>> >> me
>> >> > a
>> >> > >>>> good
>> >> > >>>> > > >> means
>> >> > >>>> > > >> > to find the neck bottle, but not enough. There still
>> some
>> >> > >>>> metric I
>> >> > >>>> > > >> think is
>> >> > >>>> > > >> > very useful:
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 1. filter ratio. i.e.. not only result_size
>> >> but
>> >> > >>>> also the
>> >> > >>>> > > >> origin
>> >> > >>>> > > >> > size so we could know how many data is filtered.
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 2. IO time. The scan_blocks_time is not
>> >> enough.
>> >> > If
>> >> > >>>> it is
>> >> > >>>> > > >> high,
>> >> > >>>> > > >> > we know somethings wrong, but not know what cause that
>> >> > >>>> problem. The
>> >> > >>>> > > >> real IO
>> >> > >>>> > > >> > time for data is not be provided. As there may be
>> several
>> >> > file
>> >> > >>>> for
>> >> > >>>> > one
>> >> > >>>> > > >> > partition, know the program slow is caused by datanode
>> or
>> >> > >>>> executor
>> >> > >>>> > > >> itself
>> >> > >>>> > > >> > give us intuition to find the problem.
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > 3. The TableBlockInfo for task. I log it by
>> >> > myself
>> >> > >>>> when
>> >> > >>>> > > >> > debugging. It tell me how many blocklets is locality.
>> The
>> >> > >>>> spark web
>> >> > >>>> > > >> monitor
>> >> > >>>> > > >> > just give a locality level, but may be only one
>> blocklet
>> >> is
>> >> > >>>> > locality.
>> >> > >>>> > > >> >
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > ---------------------
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > Anning Luo
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > *HULU*
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > Email: [hidden email]
>> >> > >>>> > > >> >
>> >> > >>>> > > >> > [hidden email]
>> >> > >>>> > > >> >
>> >> > >>>> > > >>
>> >> > >>>> > > >
>> >> > >>>> > > >
>> >> > >>>> > >
>> >> > >>>> >
>> >> > >>>>
>> >> > >>>
>> >> > >>>
>> >> > >>
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

driverlog3.txt
<https://drive.google.com/file/d/0B1XM6KeI1nB7QmZXd2d3ZlR2bnM/view?usp=drive_web>

driverlog2.txt
<https://drive.google.com/file/d/0B1XM6KeI1nB7UF92RWVOWUdwZGM/view?usp=drive_web>

driverlog1.txt
<https://drive.google.com/file/d/0B1XM6KeI1nB7WjhuNjZHZWZJZGM/view?usp=drive_web>