hi,thanks like table a has 200+filelds , this sql use id ,v_id are is positin 3,4,should i put it in the first or second id or v_id cardinality id 3w~7w , 2016-12-15 9:29 GMT+08:00 杰 [via Apache CarbonData Mailing List archive] <[hidden email]>: hi, geda |
hi, genda
yes. suggest to move to the first or second. the basic principle is, should from left to right with cardinality increasing, at the same time, if it is a commonly used filter field, u should make it dimension and put left . thanks Jay ------------------ 原始邮件 ------------------ 发件人: "geda";<[hidden email]>; 发送时间: 2016年12月15日(星期四) 上午10:23 收件人: "dev"<[hidden email]>; 主题: Re: carbondata test join question hi,thanks like table a has 200+filelds , this sql use id ,v_id are is positin 3,4,should i put it in the first or second id or v_id cardinality id 3w~7w , 2016-12-15 9:29 GMT+08:00 杰 [via Apache CarbonData Mailing List archive] < [hidden email]>: > hi, geda > can u share ur create ddl? > some suggestion: for that filter field (like id), u can try to put in > left column and use dictionary_include or exclude to make it dimension. if > cardinality more than 100 thousand, u can try to make it no dictionary. > as for ur question, if all the dimensions make no dictionary, there will > be no decode part. > > > thanks > Jay > > > > > > ------------------ Original ------------------ > From: "geda";<[hidden email] > <http:///user/SendEmail.jtp?type=node&node=4442&i=0>>; > Date: Wed, Dec 14, 2016 11:10 PM > To: "dev"<[hidden email] > <http:///user/SendEmail.jtp?type=node&node=4442&i=1>>; > > Subject: carbondata test join question > > > > hello > i want to test orc ,carbon , which is faster > test on yarn with 8 executors, each executor 4G,each 2 core > spark1.6 ,carbon2.0 > sql : join 3 table > A :300W row,5GB > B:13W row,30MB > C:7W row,10MB > like this: > select b.id ,b.d_name ,a.v_no,count(*) o_count from a left join b on > a.d_id=b.id left join c on c.v_no=a.v_no where > date(a.create_time)>= > '2016-07-01' and date(a.create_time)<= '2016-09-02' group by b.id , > b.d_name, a.v_no having o_count> 30 order by b.id desc > use context > cc.sql($SQL).show() : carbondata :run 5times avg time :7.3s > hiveContext.sql($SQL).show() : ORC : run 5times avg time:5.3s > i find from DAG ,carbon has a more job ,do carbon decode ,finish this job > cause 2-3s spend > if strip this job ,carbon and orc use time more or less the same > i want to know how to strip the last stage or how to tune sql like this > .Thanks > <http://apache-carbondata-mailing-list-archive.1130556. > n5.nabble.com/file/n4440/jobtrace.png> > <http://apache-carbondata-mailing-list-archive.1130556. > n5.nabble.com/file/n4440/laststage-carbon.png> > <http://apache-carbondata-mailing-list-archive.1130556. > n5.nabble.com/file/n4440/laststage-orc.png> > > > > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/carbondata-test- > join-question-tp4440.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-carbondata-mailing-list-archive.1130556. > n5.nabble.com/carbondata-test-join-question-tp4440p4442.html > To unsubscribe from carbondata test join question, click here > < > . > NAML > <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Re-carbondata-test-join-question-tp4447.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com. |
Free forum by Nabble | Edit this page |