Login  Register

Re: how to make carbon run faster

Posted by geda on Jan 02, 2017; 4:06am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/how-to-make-carbon-run-faster-tp5305p5333.html

1.Can you tell that the SQL generated how many rows data?

as the sql,most  id are related,so is samll, 10~20  rows as  rueturn result

2.You can try more SQL query, for example : select * from test_carbon where
status = xx (give a specific value), the example will use the most left
column to filter query(to check the indexes effectiveness)

so in this case ,no partition may be on hiveorc sql, so carbon must faster
3.Did you use how many machines(Node)? Because one executor will generate
one index B+ tree , for fully utilizing index, please try to reduce the
number of executor, suggest : one machine/node launch one executor(and
increase the executor's memory)

yes ,for duebg easy and cpu conflict,i user one executor for one core  for
each machine
but the the query run times aslo slower than orcsql


thanks

2017-01-02 11:24 GMT+08:00 Liang Chen <[hidden email]>:

> Hi
>
> Thanks for you started try Apache CarbonData project.
>
> There are may have various reasons for the test result,i assumed that you
> made time based partition for ORC data ,right ?
> 1.Can you tell that the SQL generated how many rows data?
>
> 2.You can try more SQL query, for example : select * from test_carbon where
> status = xx (give a specific value), the example will use the most left
> column to filter query(to check the indexes effectiveness)
>
> 3.Did you use how many machines(Node)? Because one executor will generate
> one index B+ tree , for fully utilizing index, please try to reduce the
> number of executor, suggest : one machine/node launch one executor(and
> increase the executor's memory)
>
> Regards
> Liang
>
>
> geda wrote
> > Hello:
> > i test the same data the same sql from two format ,1.carbondata 2,hive
> orc
> > but carbon format run  slow than orc.
> > i use carbondata with index order like create table order
> > hivesql:(dt is partition dir )
> > select count(1) as total ,status,d_id from test_orc where status !=17 and
> > v_id  in ( 91532,91533,91534,91535,91536,91537,10001 )  and   dt >=
> > '2016-11-01'  and  dt <= '2016-12-26' group by status,d_id order by total
> > desc
> > carbonsql:(create_time is timestamp type )
> >
> > select count(1) as total ,status,d_id from test_carbon where status !=17
> > and v_id  in ( 91532,91533,91534,91535,91536,91537,10001 )  and
> > date(a.create_time)>= '2016-11-01' and  date(a.create_time)<=
> '2016-12-26'
> > group by status,d_id order by total desc
> >
> > create carbondata like
> > CREATE TABLE test_carbon ( status int, v_id bigint, d_id bigint,
> > create_time timestamp
> > ...
> > ...
> > 'DICTIONARY_INCLUDE'='status,d_id,v_id,create_time')
> >
> > run with spark-shell,on 40 node ,spark1.6.1,carbon0.20,hadoop-2.6.3
> > like
> > 2month ,60days 30w row per days ,600MB csv format perday
> >  $SPARK_HOME/bin/spark-shell --verbose --name "test"   --master
> > yarn-client  --driver-memory 10G   --executor-memory 16G --num-executors
> > 40 --executor-cores 1
> >  i test many case
> >  1.
> >  gc tune ,no full gc
> >  2. spark.sql.suffle.partition
> >  all task are in run in same time
> >  3.carbon.conf set
> > enable.blocklet.distribution=true
> >
> > i use the code to test sql run time
> > val start = System.nanoTime()
> >   body
> >   (System.nanoTime() - start)/1000/1000
> >
> > body is  sqlcontext(sql).show()
> > i find orc return back faster then carbon,
> >
> > to see in ui ,some times carbon ,orc are run more or less the same (i
> > think carbon use index should be faser,or scan sequnece read is faser
> than
> > idex scan),but orc is more stable
> > ui show spend 5s,but return time orc 8s,carbon 12s.(i don't know how to
> > detch how time spend )
> >
> > here are some pic i run (run many times )
> > carbon run:
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-run1.png>
> >
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-run2.png>
> >
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-total-run1.png>
> >
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-job-total-run2.png>
> >
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/carbon-slowest-run2.png>
> >
> > orc run:
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/hiveconext-slowest-job-total-run1.png>
> >
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n5305/hiveconext-slowest-total-run1.png>
> >
> >
> > so my question is :
> > 1. why in spark-shell,sql.show(),orc sql  return faster then carbon
> > 2. in the spark ui ,carbon should use index to skip more data,scan data
> > some time use 4s, 2s, 0.2s ,how to make the slowest task faster?
> > 3. like the sql ,i use the  leftest index scan,so i think is should be
> run
> > faster than orc test in this case ,but not ,why?
> > 4.if the 3 question is ,exlain this ,my data is two small,so serial read
> > is faster than index scan ?
> >
> > sorry for my poor english ,help,thanks!
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/how-to-make-
> carbon-run-faster-tp5305p5322.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>