Apache CarbonData Dev Mailing List archive

Re: carbondata test join question

Classic

List

Threaded

2 messages Options

geda

Dec 15, 2016; 2:23am

Re: carbondata test join question

hi,thanks

like table a has 200+filelds , this sql use id ,v_id are is positin 3,4,should i put it in the first or second

id or v_id cardinality id 3w~7w ,

2016-12-15 9:29 GMT+08:00 杰 [via Apache CarbonData Mailing List archive] <[hidden email]>:

hi, geda
can u share ur create ddl?
some suggestion: for that filter field (like id), u can try to put in left column and use dictionary_include or exclude to make it dimension. if cardinality more than 100 thousand, u can try to make it no dictionary.
as for ur question, if all the dimensions make no dictionary, there will be no decode part.

thanks
Jay

------------------ Original ------------------
From: "geda";<[hidden email]>;
Date: Wed, Dec 14, 2016 11:10 PM
To: "dev"<[hidden email]>;

Subject: carbondata test join question

hello
i want to test orc ,carbon , which is faster
test on yarn with 8 executors, each executor 4G,each 2 core
spark1.6 ,carbon2.0
sql : join 3 table
A :300W row,5GB
B:13W row,30MB
C:7W row,10MB
like this:
select b.id ,b.d_name ,a.v_no,count(*) o_count from a left join b on
a.d_id=b.id left join c on c.v_no=a.v_no where date(a.create_time)>=
'2016-07-01' and date(a.create_time)<= '2016-09-02' group by b.id ,
b.d_name, a.v_no having o_count> 30 order by b.id desc
use context
cc.sql($SQL).show() : carbondata :run 5times avg time :7.3s
hiveContext.sql($SQL).show() : ORC : run 5times avg time:5.3s
i find from DAG ,carbon has a more job ,do carbon decode ,finish this job
cause 2-3s spend
if strip this job ,carbon and orc use time more or less the same
i want to know how to strip the last stage or how to tune sql like this
.Thanks
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/jobtrace.png>
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-carbon.png>
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n4440/laststage-orc.png>

--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the discussion below:
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/carbondata-test-join-question-tp4440p4442.html

To unsubscribe from carbondata test join question, click here.
NAML

... [show rest of quote]

杰

Dec 15, 2016; 6:32am

回复： carbondata test join question

hi, genda

yes. suggest to move to the first or second.
the basic principle is, should from left to right with cardinality increasing,
at the same time, if it is a commonly used filter field, u should make it dimension and put left .

thanks
Jay

------------------ 原始邮件 ------------------
发件人: "geda";<[hidden email]>;
发送时间: 2016年12月15日(星期四) 上午10:23
收件人: "dev"<[hidden email]>;

主题: Re: carbondata test join question

hi,thanks
like table a has 200+filelds , this sql use id ,v_id are is positin
3,4,should i put it in the first or second
id or v_id cardinality id 3w~7w ,

2016-12-15 9:29 GMT+08:00 杰 [via Apache CarbonData Mailing List archive] <
[hidden email]>:

> hi, geda
> can u share ur create ddl?
> some suggestion: for that filter field (like id), u can try to put in
> left column and use dictionary_include or exclude to make it dimension. if
> cardinality more than 100 thousand, u can try to make it no dictionary.
> as for ur question, if all the dimensions make no dictionary, there will
> be no decode part.
>
>
> thanks
> Jay
>
>
>
>
>
> ------------------ Original ------------------
> From: "geda";<[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4442&i=0>>;
> Date: Wed, Dec 14, 2016 11:10 PM
> To: "dev"<[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4442&i=1>>;
>
> Subject: carbondata test join question
>
>
>
> hello
> i want to test orc ,carbon , which is faster
> test on yarn with 8 executors, each executor 4G,each 2 core
> spark1.6 ,carbon2.0
> sql : join 3 table
> A :300W row,5GB
> B:13W row,30MB
> C:7W row,10MB
> like this:
> select b.id ,b.d_name ,a.v_no,count(*) o_count from a left join b on
> a.d_id=b.id left join c on c.v_no=a.v_no where
> date(a.create_time)>=
> '2016-07-01' and date(a.create_time)<= '2016-09-02' group by b.id ,
> b.d_name, a.v_no having o_count> 30 order by b.id desc
> use context
> cc.sql($SQL).show() : carbondata :run 5times avg time :7.3s
> hiveContext.sql($SQL).show() : ORC : run 5times avg time:5.3s
> i find from DAG ,carbon has a more job ,do carbon decode ,finish this job
> cause 2-3s spend
> if strip this job ,carbon and orc use time more or less the same
> i want to know how to strip the last stage or how to tune sql like this
> .Thanks
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n4440/jobtrace.png>
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n4440/laststage-carbon.png>
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/file/n4440/laststage-orc.png>
>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/carbondata-test-
> join-question-tp4440.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/carbondata-test-join-question-tp4440p4442.html
> To unsubscribe from carbondata test join question, click here
> <
> .
> NAML
> <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

... [show rest of quote]

--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Re-carbondata-test-join-question-tp4447.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.