http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/how-to-make-carbon-run-faster-tp5305p5333.html
> Hi
>
> Thanks for you started try Apache CarbonData project.
>
> There are may have various reasons for the test result,i assumed that you
> made time based partition for ORC data ,right ?
> 1.Can you tell that the SQL generated how many rows data?
>
> 2.You can try more SQL query, for example : select * from test_carbon where
> status = xx (give a specific value), the example will use the most left
> column to filter query(to check the indexes effectiveness)
>
> 3.Did you use how many machines(Node)? Because one executor will generate
> one index B+ tree , for fully utilizing index, please try to reduce the
> number of executor, suggest : one machine/node launch one executor(and
> increase the executor's memory)
>
> Regards
> Liang
>
>
> geda wrote
> > Hello:
> > i test the same data the same sql from two format ,1.carbondata 2,hive
> orc
> > but carbon format run slow than orc.
> > i use carbondata with index order like create table order
> > hivesql:(dt is partition dir )
> > select count(1) as total ,status,d_id from test_orc where status !=17 and
> > v_id in ( 91532,91533,91534,91535,91536,91537,10001 ) and dt >=
> > '2016-11-01' and dt <= '2016-12-26' group by status,d_id order by total
> > desc
> > carbonsql:(create_time is timestamp type )
> >
> > select count(1) as total ,status,d_id from test_carbon where status !=17
> > and v_id in ( 91532,91533,91534,91535,91536,91537,10001 ) and
> > date(a.create_time)>= '2016-11-01' and date(a.create_time)<=
> '2016-12-26'
> > group by status,d_id order by total desc
> >
> > create carbondata like
> > CREATE TABLE test_carbon ( status int, v_id bigint, d_id bigint,
> > create_time timestamp
> > ...
> > ...
> > 'DICTIONARY_INCLUDE'='status,d_id,v_id,create_time')
> >
> > run with spark-shell,on 40 node ,spark1.6.1,carbon0.20,hadoop-2.6.3
> > like
> > 2month ,60days 30w row per days ,600MB csv format perday
> > $SPARK_HOME/bin/spark-shell --verbose --name "test" --master
> > yarn-client --driver-memory 10G --executor-memory 16G --num-executors
> > 40 --executor-cores 1
> > i test many case
> > 1.
> > gc tune ,no full gc
> > 2. spark.sql.suffle.partition
> > all task are in run in same time
> > 3.carbon.conf set
> > enable.blocklet.distribution=true
> >
> > i use the code to test sql run time
> > val start = System.nanoTime()
> > body
> > (System.nanoTime() - start)/1000/1000
> >
> > body is sqlcontext(sql).show()
> > i find orc return back faster then carbon,
> >
> > to see in ui ,some times carbon ,orc are run more or less the same (i
> > think carbon use index should be faser,or scan sequnece read is faser
> than
> > idex scan),but orc is more stable
> > ui show spend 5s,but return time orc 8s,carbon 12s.(i don't know how to
> > detch how time spend )
> >
> > here are some pic i run (run many times )
> > carbon run:
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-run1.png>
> >
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-run2.png>
> >
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-total-run1.png>
> >
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-total-run2.png>
> >
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-run2.png>
> >
> > orc run:
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/hiveconext-slowest-job-total-run1.png>
> >
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/hiveconext-slowest-total-run1.png>
> >
> >
> > so my question is :
> > 1. why in spark-shell,sql.show(),orc sql return faster then carbon
> > 2. in the spark ui ,carbon should use index to skip more data,scan data
> > some time use 4s, 2s, 0.2s ,how to make the slowest task faster?
> > 3. like the sql ,i use the leftest index scan,so i think is should be
> run
> > faster than orc test in this case ,but not ,why?
> > 4.if the 3 question is ,exlain this ,my data is two small,so serial read
> > is faster than index scan ?
> >
> > sorry for my poor english ,help,thanks!
>
>
>
>
>
> --
> View this message in context:
http://apache-carbondata-> mailing-list-archive.1130556.n5.nabble.com/how-to-make-
> carbon-run-faster-tp5305p5322.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>