hello
i want to test orc ,carbon , which is faster test on yarn with 8 executors, each executor 4G,each 2 core spark1.6 ,carbon2.0 sql : join 3 table A :300W row,5GB B:13W row,30MB C:7W row,10MB like this: select b.id ,b.d_name ,a.v_no,count(*) o_count from a left join b on a.d_id=b.id left join c on c.v_no=a.v_no where date(a.create_time)>= '2016-07-01' and date(a.create_time)<= '2016-09-02' group by b.id , b.d_name, a.v_no having o_count> 30 order by b.id desc use context cc.sql($SQL).show() : carbondata :run 5times avg time :7.3s hiveContext.sql($SQL).show() : ORC : run 5times avg time:5.3s i find from DAG ,carbon has a more job ,do carbon decode ,finish this job cause 2-3s spend if strip this job ,carbon and orc use time more or less the same i want to know how to strip the last stage or how to tune sql like this .Thanks |
hi, geda
can u share ur create ddl? some suggestion: for that filter field (like id), u can try to put in left column and use dictionary_include or exclude to make it dimension. if cardinality more than 100 thousand, u can try to make it no dictionary. as for ur question, if all the dimensions make no dictionary, there will be no decode part. thanks Jay ------------------ Original ------------------ From: "geda";<[hidden email]>; Date: Wed, Dec 14, 2016 11:10 PM To: "dev"<[hidden email]>; Subject: carbondata test join question hello i want to test orc ,carbon , which is faster test on yarn with 8 executors, each executor 4G,each 2 core spark1.6 ,carbon2.0 sql : join 3 table A :300W row,5GB B:13W row,30MB C:7W row,10MB like this: select b.id ,b.d_name ,a.v_no,count(*) o_count from a left join b on a.d_id=b.id left join c on c.v_no=a.v_no where date(a.create_time)>= '2016-07-01' and date(a.create_time)<= '2016-09-02' group by b.id , b.d_name, a.v_no having o_count> 30 order by b.id desc use context cc.sql($SQL).show() : carbondata :run 5times avg time :7.3s hiveContext.sql($SQL).show() : ORC : run 5times avg time:5.3s i find from DAG ,carbon has a more job ,do carbon decode ,finish this job cause 2-3s spend if strip this job ,carbon and orc use time more or less the same i want to know how to strip the last stage or how to tune sql like this .Thanks <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/jobtrace.png> <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-carbon.png> <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-orc.png> -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com. |
Administrator
|
In reply to this post by geda
Hi geda
As we know, CarbonData's key feature is index. About tuning SQL, you can refer to : https://cwiki.apache.org/confluence/display/CARBONDATA/FAQ Regards Liang |
Free forum by Nabble | Edit this page |