carbondata test join question

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

carbondata test join question

geda
hello
i  want to test orc ,carbon , which is faster
test on yarn with 8 executors, each executor 4G,each 2 core
spark1.6 ,carbon2.0
sql : join  3 table
A :300W row,5GB
B:13W row,30MB
C:7W row,10MB
like this:
select b.id ,b.d_name ,a.v_no,count(*) o_count  from    a left join  b on a.d_id=b.id left join    c on  c.v_no=a.v_no  where  date(a.create_time)>= '2016-07-01' and  date(a.create_time)<= '2016-09-02' group by  b.id , b.d_name, a.v_no  having  o_count> 30 order by b.id desc
use context
cc.sql($SQL).show() : carbondata :run 5times  avg time :7.3s
hiveContext.sql($SQL).show() : ORC : run 5times  avg time:5.3s
i find from DAG ,carbon has a more job ,do carbon decode ,finish this job cause 2-3s spend
if strip this job ,carbon and orc  use time more or less the same
i want to know how to strip  the last stage or how to tune sql like this .Thanks





Reply | Threaded
Open this post in threaded view
|

Re: carbondata test join question

杰
hi, geda
 can u share ur create ddl?
 some suggestion: for that filter field (like id), u can try to put in left column and use dictionary_include or exclude to make it dimension. if  cardinality more than 100 thousand, u can try to make it no dictionary.
 as for ur question, if all the dimensions make no dictionary, there will be no decode part.


thanks
Jay





------------------ Original ------------------
From:  "geda";<[hidden email]>;
Date:  Wed, Dec 14, 2016 11:10 PM
To:  "dev"<[hidden email]>;

Subject:  carbondata test join  question



hello
i  want to test orc ,carbon , which is faster
test on yarn with 8 executors, each executor 4G,each 2 core
spark1.6 ,carbon2.0
sql : join  3 table
A :300W row,5GB
B:13W row,30MB
C:7W row,10MB
like this:
select b.id ,b.d_name ,a.v_no,count(*) o_count  from    a left join  b on
a.d_id=b.id left join    c on  c.v_no=a.v_no  where  date(a.create_time)>=
'2016-07-01' and  date(a.create_time)<= '2016-09-02' group by  b.id ,
b.d_name, a.v_no  having  o_count> 30 order by b.id desc
use context
cc.sql($SQL).show() : carbondata :run 5times  avg time :7.3s
hiveContext.sql($SQL).show() : ORC : run 5times  avg time:5.3s
i find from DAG ,carbon has a more job ,do carbon decode ,finish this job
cause 2-3s spend
if strip this job ,carbon and orc  use time more or less the same
i want to know how to strip  the last stage or how to tune sql like this
.Thanks
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/jobtrace.png>
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-carbon.png>
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-orc.png>






--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: carbondata test join question

Liang Chen
Administrator
In reply to this post by geda
Hi geda

As we know, CarbonData's key feature is index.  
About tuning SQL, you can refer to  : https://cwiki.apache.org/confluence/display/CARBONDATA/FAQ

Regards
Liang