Apache CarbonData Dev Mailing List archive

回复： carbondata test join question

Posted by 杰 on Dec 15, 2016; 6:32am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Re-carbondata-test-join-question-tp4447p4459.html

hi, genda

yes. suggest to move to the first or second.
the basic principle is, should from left to right with cardinality increasing,
at the same time, if it is a commonly used filter field, u should make it dimension and put left .

thanks
Jay

------------------ 原始邮件 ------------------
发件人: "geda";<[hidden email]>;
发送时间: 2016年12月15日(星期四) 上午10:23
收件人: "dev"<[hidden email]>;

主题: Re: carbondata test join question

hi,thanks
like table a has 200+filelds , this sql use id ,v_id are is positin
3,4,should i put it in the first or second
id or v_id cardinality id 3w~7w ,

2016-12-15 9:29 GMT+08:00 杰 [via Apache CarbonData Mailing List archive] <
[hidden email]>:

> hi, geda
> can u share ur create ddl?
> some suggestion: for that filter field (like id), u can try to put in
> left column and use dictionary_include or exclude to make it dimension. if
> cardinality more than 100 thousand, u can try to make it no dictionary.
> as for ur question, if all the dimensions make no dictionary, there will
> be no decode part.
>
>
> thanks
> Jay
>
>
>
>
>
> ------------------ Original ------------------
> From: "geda";<[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4442&i=0>>;
> Date: Wed, Dec 14, 2016 11:10 PM
> To: "dev"<[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4442&i=1>>;
>
> Subject: carbondata test join question
>
>
>
> hello
> i want to test orc ,carbon , which is faster
> test on yarn with 8 executors, each executor 4G,each 2 core
> spark1.6 ,carbon2.0
> sql : join 3 table
> A :300W row,5GB
> B:13W row,30MB
> C:7W row,10MB
> like this:
> select b.id ,b.d_name ,a.v_no,count(*) o_count from a left join b on
> a.d_id=b.id left join c on c.v_no=a.v_no where
> date(a.create_time)>=
> '2016-07-01' and date(a.create_time)<= '2016-09-02' group by b.id ,
> b.d_name, a.v_no having o_count> 30 order by b.id desc
> use context
> cc.sql($SQL).show() : carbondata :run 5times avg time :7.3s
> hiveContext.sql($SQL).show() : ORC : run 5times avg time:5.3s
> i find from DAG ,carbon has a more job ,do carbon decode ,finish this job
> cause 2-3s spend
> if strip this job ,carbon and orc use time more or less the same
> i want to know how to strip the last stage or how to tune sql like this
> .Thanks
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n4440/jobtrace.png>
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n4440/laststage-carbon.png>
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n4440/laststage-orc.png>
>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/carbondata-test-
> join-question-tp4440.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/carbondata-test-join-question-tp4440p4442.html
> To unsubscribe from carbondata test join question, click here
> <
> .
> NAML
> <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Re-carbondata-test-join-question-tp4447.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.