[DISCUSS] For the dimension default should be no dictionary

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] For the dimension default should be no dictionary

bill.zhou
hi All
    Now when create the CarbonData table,if  the dimension don't add into the dictionary_exclude properties, the dimension will be consider as dictionary default. I think default should be no dictionary.

    For example when I do the POC for one customer, it has 300 columns and 200 dimensions, but only 5 columns is used for filter, so he only need set this 5 columns to dictionary and leave other 195 columns to no dictionary. But now he need specify for the 195 columns to dictionary_exclude properties the will waste time and make the create table command huge, also will impact the load performance.

    So I suggestion dimension default should be no dictionary and this can also help customer easy to know the dictionary column which is useful.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

ravipesala
Hi,

I feel there are more disadvantages than advantages in this approach. In
your current scenario you want to set dictionary only for columns which are
used as filters, but the usage of dictionary is not only limited for
filters, it can reduce the store size and improve the aggregation queries.
I think you should set no_inverted_index false on non filtered columns to
reduce the store size and improve the performance.

If we make no dictionary as default then user no need set them in DDL but
user needs to set the dictionary columns. If user wants to set more
dictionary columns then the same problem what you mentioned arises again so
it does not solve the problem. I feel we should give more flexibility in
our DDL to simplify these scenarios and we should have more discussion on
it.

Pros & Cons of your suggestion.
Advantages :
1. Decoding/Encoding of dictionary could be avoided.

Disadvantages :
1. Store size will increase drastically.
2. IO will increase so query performance will come down.
3. Aggregation queries performance will suffer.



Regards,
Ravindra.

On 26 February 2017 at 20:04, bill.zhou <[hidden email]> wrote:

> hi All
>     Now when create the CarbonData table,if  the dimension don't add into
> the dictionary_exclude properties, the dimension will be consider as
> dictionary default. I think default should be no dictionary.
>
>     For example when I do the POC for one customer, it has 300 columns and
> 200 dimensions, but only 5 columns is used for filter, so he only need set
> this 5 columns to dictionary and leave other 195 columns to no dictionary.
> But now he need specify for the 195 columns to dictionary_exclude
> properties
> the will waste time and make the create table command huge, also will
> impact
> the load performance.
>
>     So I suggestion dimension default should be no dictionary and this can
> also help customer easy to know the dictionary column which is useful.
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-tp8010.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

kumarvishal09
Hi,
    I completely agree with Ravindra's points, more number of no dictionary
column will impact the IO reading+writing both as in case of no dictionary
data size will increase. Late decoding is one of main advantage, no
dictionary column aggregation will be slower. Filter query will suffer as
in case of dictionary column we are comparing on byte pack value, in case
of no dictionary it will be on actual value.

-Regards
Kumar Vishal

On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala <[hidden email]>
wrote:

> Hi,
>
> I feel there are more disadvantages than advantages in this approach. In
> your current scenario you want to set dictionary only for columns which are
> used as filters, but the usage of dictionary is not only limited for
> filters, it can reduce the store size and improve the aggregation queries.
> I think you should set no_inverted_index false on non filtered columns to
> reduce the store size and improve the performance.
>
> If we make no dictionary as default then user no need set them in DDL but
> user needs to set the dictionary columns. If user wants to set more
> dictionary columns then the same problem what you mentioned arises again so
> it does not solve the problem. I feel we should give more flexibility in
> our DDL to simplify these scenarios and we should have more discussion on
> it.
>
> Pros & Cons of your suggestion.
> Advantages :
> 1. Decoding/Encoding of dictionary could be avoided.
>
> Disadvantages :
> 1. Store size will increase drastically.
> 2. IO will increase so query performance will come down.
> 3. Aggregation queries performance will suffer.
>
>
>
> Regards,
> Ravindra.
>
> On 26 February 2017 at 20:04, bill.zhou <[hidden email]> wrote:
>
> > hi All
> >     Now when create the CarbonData table,if  the dimension don't add into
> > the dictionary_exclude properties, the dimension will be consider as
> > dictionary default. I think default should be no dictionary.
> >
> >     For example when I do the POC for one customer, it has 300 columns
> and
> > 200 dimensions, but only 5 columns is used for filter, so he only need
> set
> > this 5 columns to dictionary and leave other 195 columns to no
> dictionary.
> > But now he need specify for the 195 columns to dictionary_exclude
> > properties
> > the will waste time and make the create table command huge, also will
> > impact
> > the load performance.
> >
> >     So I suggestion dimension default should be no dictionary and this
> can
> > also help customer easy to know the dictionary column which is useful.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > dimension-default-should-be-no-dictionary-tp8010.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

bill.zhou
Dear Vishal & Ravindra
 
  Thanks for you reply,  I think I didn't describe it clearly so that you don't get full idea.
1. dictionary is important feature in CarbonData, for every new customer we will introduce this feature to him. So for new customer will know it clearly, will set the dictionary column when create table.
2. For all customer like bank customer, telecom customer and traffic customer have a same scenario is: have more column but only set few column as dictionary.
    like telecom customer, 300 column only set 5 column dictionary, other dim don't set dictionary.
    like bank customer, 100 column only set about 5 column dictionary, other dim don't set dictionary.
For currently customer actually user scenario, they only set the dim which used for filter and group by related column as dictionary
3. mys suggestion is that: dim column default as no dictionary is only for the dim which not put into the dictionary_include properties, not for all dim column. If customer always used 5 columns add into dictionary_include and others column no dictionary, this will not impact the query performance.

So that I suggestion the dim column default set as no dictionary which not added in to dictionary_include properties.

Regards
Bill


kumarvishal09 wrote
Hi,
    I completely agree with Ravindra's points, more number of no dictionary
column will impact the IO reading+writing both as in case of no dictionary
data size will increase. Late decoding is one of main advantage, no
dictionary column aggregation will be slower. Filter query will suffer as
in case of dictionary column we are comparing on byte pack value, in case
of no dictionary it will be on actual value.

-Regards
Kumar Vishal

On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala <[hidden email]>
wrote:

> Hi,
>
> I feel there are more disadvantages than advantages in this approach. In
> your current scenario you want to set dictionary only for columns which are
> used as filters, but the usage of dictionary is not only limited for
> filters, it can reduce the store size and improve the aggregation queries.
> I think you should set no_inverted_index false on non filtered columns to
> reduce the store size and improve the performance.
>
> If we make no dictionary as default then user no need set them in DDL but
> user needs to set the dictionary columns. If user wants to set more
> dictionary columns then the same problem what you mentioned arises again so
> it does not solve the problem. I feel we should give more flexibility in
> our DDL to simplify these scenarios and we should have more discussion on
> it.
>
> Pros & Cons of your suggestion.
> Advantages :
> 1. Decoding/Encoding of dictionary could be avoided.
>
> Disadvantages :
> 1. Store size will increase drastically.
> 2. IO will increase so query performance will come down.
> 3. Aggregation queries performance will suffer.
>
>
>
> Regards,
> Ravindra.
>
> On 26 February 2017 at 20:04, bill.zhou <[hidden email]> wrote:
>
> > hi All
> >     Now when create the CarbonData table,if  the dimension don't add into
> > the dictionary_exclude properties, the dimension will be consider as
> > dictionary default. I think default should be no dictionary.
> >
> >     For example when I do the POC for one customer, it has 300 columns
> and
> > 200 dimensions, but only 5 columns is used for filter, so he only need
> set
> > this 5 columns to dictionary and leave other 195 columns to no
> dictionary.
> > But now he need specify for the 195 columns to dictionary_exclude
> > properties
> > the will waste time and make the create table command huge, also will
> > impact
> > the load performance.
> >
> >     So I suggestion dimension default should be no dictionary and this
> can
> > also help customer easy to know the dictionary column which is useful.
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > dimension-default-should-be-no-dictionary-tp8010.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

ravipesala
Hi Bill,

I got your point, but the solution of making no-dictionary as default may
not be perfect solution. Basically no-dictionary columns are only meant for
high cardinality dimensions, so the usage may change from user to user or
scenario to scenario .
This is the basic issue of usability of DDL, please first focus on to
simplify DDL usability.

For example we have 6 columns , we can mention DDL as below.
case 1 :
SORT_COLUMNS="C1,C2,C3"
NON_SORT_COLUMNS="C4,C5,C6"
In above case C1, C2 , C3 are sort columns and part of MDK key. And
C4,C5,C6 are become non sort columns(measure/complex)

DICTIONARY_EXCLUDE= 'ALL'
DICTIONARY_INCLUDE='C3'
In the above case all sort columns((C1,C2,C3) are non-dictionary columns
except C3, here C3 is dictionary column.

case 2:
SORT_COLUMNS="ALL"
NON_SORT_COLUMNS="C6"
In this case all columns are sort columns except C6.

DICTIONARY_EXCLUDE= 'C2'
DICTIONARY_INCLUDE='ALL'
In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
except C2, here C2 is no-dictionary column.

Above mentioned are just my idea of how to simplify DDL to handle all
scenarios. We can have more discussion towards it to simplify the DDL.

Regards,
Ravindra.

On 27 February 2017 at 12:38, bill.zhou <[hidden email]> wrote:

> Dear Vishal & Ravindra
>
>   Thanks for you reply,  I think I didn't describe it clearly so that you
> don't get full idea.
> 1. dictionary is important feature in CarbonData, for every new customer we
> will introduce this feature to him. So for new customer will know it
> clearly, will set the dictionary column when create table.
> 2. For all customer like bank customer, telecom customer and traffic
> customer have a same scenario is: have more column but only set few column
> as dictionary.
>     like telecom customer, 300 column only set 5 column dictionary, other
> dim don't set dictionary.
>     like bank customer, 100 column only set about 5 column dictionary,
> other
> dim don't set dictionary.
> *For currently customer actually user scenario, they only set the dim which
> used for filter and group by related column as dictionary*
> 3. mys suggestion is that: dim column default as no dictionary is only for
> the dim which not put into the dictionary_include properties, not for all
> dim column. If customer always used 5 columns add into dictionary_include
> and others column no dictionary, this will not impact the query
> performance.
>
> So that I suggestion the dim column default set as no dictionary which not
> added in to dictionary_include properties.
>
> Regards
> Bill
>
>
>
> kumarvishal09 wrote
> > Hi,
> >     I completely agree with Ravindra's points, more number of no
> > dictionary
> > column will impact the IO reading+writing both as in case of no
> dictionary
> > data size will increase. Late decoding is one of main advantage, no
> > dictionary column aggregation will be slower. Filter query will suffer as
> > in case of dictionary column we are comparing on byte pack value, in case
> > of no dictionary it will be on actual value.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala &lt;
>
> > ravi.pesala@
>
> > &gt;
> > wrote:
> >
> >> Hi,
> >>
> >> I feel there are more disadvantages than advantages in this approach. In
> >> your current scenario you want to set dictionary only for columns which
> >> are
> >> used as filters, but the usage of dictionary is not only limited for
> >> filters, it can reduce the store size and improve the aggregation
> >> queries.
> >> I think you should set no_inverted_index false on non filtered columns
> to
> >> reduce the store size and improve the performance.
> >>
> >> If we make no dictionary as default then user no need set them in DDL
> but
> >> user needs to set the dictionary columns. If user wants to set more
> >> dictionary columns then the same problem what you mentioned arises again
> >> so
> >> it does not solve the problem. I feel we should give more flexibility in
> >> our DDL to simplify these scenarios and we should have more discussion
> on
> >> it.
> >>
> >> Pros & Cons of your suggestion.
> >> Advantages :
> >> 1. Decoding/Encoding of dictionary could be avoided.
> >>
> >> Disadvantages :
> >> 1. Store size will increase drastically.
> >> 2. IO will increase so query performance will come down.
> >> 3. Aggregation queries performance will suffer.
> >>
> >>
> >>
> >> Regards,
> >> Ravindra.
> >>
> >> On 26 February 2017 at 20:04, bill.zhou &lt;
>
> > zgcsky08@
>
> > &gt; wrote:
> >>
> >> > hi All
> >> >     Now when create the CarbonData table,if  the dimension don't add
> >> into
> >> > the dictionary_exclude properties, the dimension will be consider as
> >> > dictionary default. I think default should be no dictionary.
> >> >
> >> >     For example when I do the POC for one customer, it has 300 columns
> >> and
> >> > 200 dimensions, but only 5 columns is used for filter, so he only need
> >> set
> >> > this 5 columns to dictionary and leave other 195 columns to no
> >> dictionary.
> >> > But now he need specify for the 195 columns to dictionary_exclude
> >> > properties
> >> > the will waste time and make the create table command huge, also will
> >> > impact
> >> > the load performance.
> >> >
> >> >     So I suggestion dimension default should be no dictionary and this
> >> can
> >> > also help customer easy to know the dictionary column which is useful.
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://apache-carbondata-
> >> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> >> > dimension-default-should-be-no-dictionary-tp8010.html
> >> > Sent from the Apache CarbonData Mailing List archive mailing list
> >> archive
> >> > at Nabble.com.
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Ravi
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-tp8010p8027.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

Liang Chen
Administrator
Hi

+1  , through adding "DICTIONARY_EXCLUDE= 'ALL'  and DICTIONARY_INCLUDE=
'ALL' " to improve the usability of DDL.
This solution is more flexible than put no-dictionary as default.

Regards
Liang

2017-02-27 20:27 GMT+08:00 Ravindra Pesala <[hidden email]>:

> Hi Bill,
>
> I got your point, but the solution of making no-dictionary as default may
> not be perfect solution. Basically no-dictionary columns are only meant for
> high cardinality dimensions, so the usage may change from user to user or
> scenario to scenario .
> This is the basic issue of usability of DDL, please first focus on to
> simplify DDL usability.
>
> For example we have 6 columns , we can mention DDL as below.
> case 1 :
> SORT_COLUMNS="C1,C2,C3"
> NON_SORT_COLUMNS="C4,C5,C6"
> In above case C1, C2 , C3 are sort columns and part of MDK key. And
> C4,C5,C6 are become non sort columns(measure/complex)
>
> DICTIONARY_EXCLUDE= 'ALL'
> DICTIONARY_INCLUDE='C3'
> In the above case all sort columns((C1,C2,C3) are non-dictionary columns
> except C3, here C3 is dictionary column.
>
> case 2:
> SORT_COLUMNS="ALL"
> NON_SORT_COLUMNS="C6"
> In this case all columns are sort columns except C6.
>
> DICTIONARY_EXCLUDE= 'C2'
> DICTIONARY_INCLUDE='ALL'
> In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
> except C2, here C2 is no-dictionary column.
>
> Above mentioned are just my idea of how to simplify DDL to handle all
> scenarios. We can have more discussion towards it to simplify the DDL.
>
> Regards,
> Ravindra.
>
> On 27 February 2017 at 12:38, bill.zhou <[hidden email]> wrote:
>
> > Dear Vishal & Ravindra
> >
> >   Thanks for you reply,  I think I didn't describe it clearly so that you
> > don't get full idea.
> > 1. dictionary is important feature in CarbonData, for every new customer
> we
> > will introduce this feature to him. So for new customer will know it
> > clearly, will set the dictionary column when create table.
> > 2. For all customer like bank customer, telecom customer and traffic
> > customer have a same scenario is: have more column but only set few
> column
> > as dictionary.
> >     like telecom customer, 300 column only set 5 column dictionary, other
> > dim don't set dictionary.
> >     like bank customer, 100 column only set about 5 column dictionary,
> > other
> > dim don't set dictionary.
> > *For currently customer actually user scenario, they only set the dim
> which
> > used for filter and group by related column as dictionary*
> > 3. mys suggestion is that: dim column default as no dictionary is only
> for
> > the dim which not put into the dictionary_include properties, not for all
> > dim column. If customer always used 5 columns add into dictionary_include
> > and others column no dictionary, this will not impact the query
> > performance.
> >
> > So that I suggestion the dim column default set as no dictionary which
> not
> > added in to dictionary_include properties.
> >
> > Regards
> > Bill
> >
> >
> >
> > kumarvishal09 wrote
> > > Hi,
> > >     I completely agree with Ravindra's points, more number of no
> > > dictionary
> > > column will impact the IO reading+writing both as in case of no
> > dictionary
> > > data size will increase. Late decoding is one of main advantage, no
> > > dictionary column aggregation will be slower. Filter query will suffer
> as
> > > in case of dictionary column we are comparing on byte pack value, in
> case
> > > of no dictionary it will be on actual value.
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala &lt;
> >
> > > ravi.pesala@
> >
> > > &gt;
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I feel there are more disadvantages than advantages in this approach.
> In
> > >> your current scenario you want to set dictionary only for columns
> which
> > >> are
> > >> used as filters, but the usage of dictionary is not only limited for
> > >> filters, it can reduce the store size and improve the aggregation
> > >> queries.
> > >> I think you should set no_inverted_index false on non filtered columns
> > to
> > >> reduce the store size and improve the performance.
> > >>
> > >> If we make no dictionary as default then user no need set them in DDL
> > but
> > >> user needs to set the dictionary columns. If user wants to set more
> > >> dictionary columns then the same problem what you mentioned arises
> again
> > >> so
> > >> it does not solve the problem. I feel we should give more flexibility
> in
> > >> our DDL to simplify these scenarios and we should have more discussion
> > on
> > >> it.
> > >>
> > >> Pros & Cons of your suggestion.
> > >> Advantages :
> > >> 1. Decoding/Encoding of dictionary could be avoided.
> > >>
> > >> Disadvantages :
> > >> 1. Store size will increase drastically.
> > >> 2. IO will increase so query performance will come down.
> > >> 3. Aggregation queries performance will suffer.
> > >>
> > >>
> > >>
> > >> Regards,
> > >> Ravindra.
> > >>
> > >> On 26 February 2017 at 20:04, bill.zhou &lt;
> >
> > > zgcsky08@
> >
> > > &gt; wrote:
> > >>
> > >> > hi All
> > >> >     Now when create the CarbonData table,if  the dimension don't add
> > >> into
> > >> > the dictionary_exclude properties, the dimension will be consider as
> > >> > dictionary default. I think default should be no dictionary.
> > >> >
> > >> >     For example when I do the POC for one customer, it has 300
> columns
> > >> and
> > >> > 200 dimensions, but only 5 columns is used for filter, so he only
> need
> > >> set
> > >> > this 5 columns to dictionary and leave other 195 columns to no
> > >> dictionary.
> > >> > But now he need specify for the 195 columns to dictionary_exclude
> > >> > properties
> > >> > the will waste time and make the create table command huge, also
> will
> > >> > impact
> > >> > the load performance.
> > >> >
> > >> >     So I suggestion dimension default should be no dictionary and
> this
> > >> can
> > >> > also help customer easy to know the dictionary column which is
> useful.
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > View this message in context: http://apache-carbondata-
> > >> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > >> > dimension-default-should-be-no-dictionary-tp8010.html
> > >> > Sent from the Apache CarbonData Mailing List archive mailing list
> > >> archive
> > >> > at Nabble.com.
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks & Regards,
> > >> Ravi
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > dimension-default-should-be-no-dictionary-tp8010p8027.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

Jacky Li
In reply to this post by ravipesala
Yes, first we should simplify the DDL options. I propose following options, please check weather it miss some scenario.

1. SORT_COLUMNS, or SORT_KEY
This indicates three things:
1) All columns specified in options will be used to construct Multi-Dimensional Key, which will be sorted along this key
2) They will be encoded as Inverted Index and thus again sorted within column chunk in one blocklet
3) Minmax index will also be created for these columns

When to use: This option is designed for accelerating filter query, so put all filter columns into this option. The order of it can be:
1) From low cardinality to high cardinality, this will make most compression and fit for scenario that does not have frequent filter on high card column
2) Put high cardinality column first, then put others. This fits for frequent filter on high card column

For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as Inverted Index and with Minmax Index
Note that while C1,C2,C3 can be dimension but they also can be measure. So if user need to filter on measure column, it can be put in SORT_COLUMNS option.

If this option is not specified by user, carbon will pick MDK as it is now.

2. TABLE_DICTIONARY
This is to specify the table level dictionary columns. Will create global dictionary for all columns in this option for every data load.

When to use: The option is designed for accelerating aggregate query, so put group by columns into this option

For example. TABLE_DICTIONARY=“C2,C3,C5”

If this option is not specified by user, means all columns encoding without global dictionary support. Normal shuffle on decoded value will be applied when doing group by operation.

I think these two options should be the basic option for normal user, the goal of them is to satisfy the most scenario without deep tuning of the table
For advanced user who want to do deep tuning, we can debate to add more options. But we need to identify what scenario is not satisfied by using these two options first.

Regards,
Jacky
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

Liang Chen
Administrator
Hi

A couple of questions:

1) For SORT_KEY option: only build "MDK index, inverted index, minmax
index" for these columns which be specified into the option(SORT_KEY)  ?

2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
dictionary encoding, and all shuffle operations are based on fact value, is
my understanding right ?
-------------------------------------------------------------------------------------------------------
If this option is not specified by user, means all columns encoding without
global dictionary support. Normal shuffle on decoded value will be applied
when doing group by operation.

3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
supposed  if "C2" be specified into SORT_KEY, but not be specified into
TABLE_DICTIONARY, then system how to handle this case ?
-----------------------------------------------------------------------------------------------------------
For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
Inverted Index and with Minmax Index

Regards
Liang

2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:

> Yes, first we should simplify the DDL options. I propose following options,
> please check weather it miss some scenario.
>
> 1. SORT_COLUMNS, or SORT_KEY
> This indicates three things:
> 1) All columns specified in options will be used to construct
> Multi-Dimensional Key, which will be sorted along this key
> 2) They will be encoded as Inverted Index and thus again sorted within
> column chunk in one blocklet
> 3) Minmax index will also be created for these columns
>
> When to use: This option is designed for accelerating filter query, so put
> all filter columns into this option. The order of it can be:
> 1) From low cardinality to high cardinality, this will make most
> compression
> and fit for scenario that does not have frequent filter on high card column
> 2) Put high cardinality column first, then put others. This fits for
> frequent filter on high card column
>
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
> Note that while C1,C2,C3 can be dimension but they also can be measure. So
> if user need to filter on measure column, it can be put in SORT_COLUMNS
> option.
>
> If this option is not specified by user, carbon will pick MDK as it is now.
>
> 2. TABLE_DICTIONARY
> This is to specify the table level dictionary columns. Will create global
> dictionary for all columns in this option for every data load.
>
> When to use: The option is designed for accelerating aggregate query, so
> put
> group by columns into this option
>
> For example. TABLE_DICTIONARY=“C2,C3,C5”
>
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
>
> I think these two options should be the basic option for normal user, the
> goal of them is to satisfy the most scenario without deep tuning of the
> table
> For advanced user who want to do deep tuning, we can debate to add more
> options. But we need to identify what scenario is not satisfied by using
> these two options first.
>
> Regards,
> Jacky
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-tp8010p8081.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Regards
Liang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

bill.zhou
In reply to this post by ravipesala
hi Ravindra

That is a good idea to conside the sort column and dictioanry column together.
For the DDL usability I have following suggestion. please share your suggestion
1. sort columns properties better keep the same style like dictionary.
   so the key word suggestion changed to SORT_INCLUDE and SORT_EXECLUDE
   
2. The user may be confusion if the DICTIONARY_EXCLUDE= 'ALL' and DICTIONARY_INCLUDE='C3' come together.

3.the value in the sort and dictioanry properties better only allow column
   If allowed DICTIONARY_EXCLUDE= 'ALL', the "ALL" may be conflict with actually table column name.
 
So I think the key point is how conside the default value which don't set in INCLUDE or EXECLUDE. because for end user, if he put the column in INCLUDE or EXECLUDE, that means this column is important and concered for user.

So my suggestion as following: add one more properties called xxx_DEFAULT
For example we have 6 columns , we can mention DDL as below.
case 1 :
SORT_INCLUDE="C1,C2,C3"
SORT_EXCLUDE="C4,C5,C6"
In above case C1, C2 , C3 are sort columns and part of MDK key. And
C4,C5,C6 are become non sort columns(measure/complex)

DICTIONARY_DEFAULT= 'EXECLUDE'
DICTIONARY_INCLUDE='C3'
In the above case all sort columns((C1,C2,C3) are non-dictionary columns
except C3, here C3 is dictionary column.

case 2:
SORT_DEFAULT="INCLUDE"
SORT_EXCLUDE="C6"
In this case all columns are sort columns except C6.

DICTIONARY_EXCLUDE= 'C2'
DICTIONARY_DEFAULT='INCLUDE'
In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
except C2, here C2 is no-dictionary column.



ravipesala wrote
Hi Bill,

I got your point, but the solution of making no-dictionary as default may
not be perfect solution. Basically no-dictionary columns are only meant for
high cardinality dimensions, so the usage may change from user to user or
scenario to scenario .
This is the basic issue of usability of DDL, please first focus on to
simplify DDL usability.

For example we have 6 columns , we can mention DDL as below.
case 1 :
SORT_COLUMNS="C1,C2,C3"
NON_SORT_COLUMNS="C4,C5,C6"
In above case C1, C2 , C3 are sort columns and part of MDK key. And
C4,C5,C6 are become non sort columns(measure/complex)

DICTIONARY_EXCLUDE= 'ALL'
DICTIONARY_INCLUDE='C3'
In the above case all sort columns((C1,C2,C3) are non-dictionary columns
except C3, here C3 is dictionary column.

case 2:
SORT_COLUMNS="ALL"
NON_SORT_COLUMNS="C6"
In this case all columns are sort columns except C6.

DICTIONARY_EXCLUDE= 'C2'
DICTIONARY_INCLUDE='ALL'
In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
except C2, here C2 is no-dictionary column.

Above mentioned are just my idea of how to simplify DDL to handle all
scenarios. We can have more discussion towards it to simplify the DDL.

Regards,
Ravindra.

On 27 February 2017 at 12:38, bill.zhou <[hidden email]> wrote:

> Dear Vishal & Ravindra
>
>   Thanks for you reply,  I think I didn't describe it clearly so that you
> don't get full idea.
> 1. dictionary is important feature in CarbonData, for every new customer we
> will introduce this feature to him. So for new customer will know it
> clearly, will set the dictionary column when create table.
> 2. For all customer like bank customer, telecom customer and traffic
> customer have a same scenario is: have more column but only set few column
> as dictionary.
>     like telecom customer, 300 column only set 5 column dictionary, other
> dim don't set dictionary.
>     like bank customer, 100 column only set about 5 column dictionary,
> other
> dim don't set dictionary.
> *For currently customer actually user scenario, they only set the dim which
> used for filter and group by related column as dictionary*
> 3. mys suggestion is that: dim column default as no dictionary is only for
> the dim which not put into the dictionary_include properties, not for all
> dim column. If customer always used 5 columns add into dictionary_include
> and others column no dictionary, this will not impact the query
> performance.
>
> So that I suggestion the dim column default set as no dictionary which not
> added in to dictionary_include properties.
>
> Regards
> Bill
>
>
>
> kumarvishal09 wrote
> > Hi,
> >     I completely agree with Ravindra's points, more number of no
> > dictionary
> > column will impact the IO reading+writing both as in case of no
> dictionary
> > data size will increase. Late decoding is one of main advantage, no
> > dictionary column aggregation will be slower. Filter query will suffer as
> > in case of dictionary column we are comparing on byte pack value, in case
> > of no dictionary it will be on actual value.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala <
>
> > ravi.pesala@
>
> > >
> > wrote:
> >
> >> Hi,
> >>
> >> I feel there are more disadvantages than advantages in this approach. In
> >> your current scenario you want to set dictionary only for columns which
> >> are
> >> used as filters, but the usage of dictionary is not only limited for
> >> filters, it can reduce the store size and improve the aggregation
> >> queries.
> >> I think you should set no_inverted_index false on non filtered columns
> to
> >> reduce the store size and improve the performance.
> >>
> >> If we make no dictionary as default then user no need set them in DDL
> but
> >> user needs to set the dictionary columns. If user wants to set more
> >> dictionary columns then the same problem what you mentioned arises again
> >> so
> >> it does not solve the problem. I feel we should give more flexibility in
> >> our DDL to simplify these scenarios and we should have more discussion
> on
> >> it.
> >>
> >> Pros & Cons of your suggestion.
> >> Advantages :
> >> 1. Decoding/Encoding of dictionary could be avoided.
> >>
> >> Disadvantages :
> >> 1. Store size will increase drastically.
> >> 2. IO will increase so query performance will come down.
> >> 3. Aggregation queries performance will suffer.
> >>
> >>
> >>
> >> Regards,
> >> Ravindra.
> >>
> >> On 26 February 2017 at 20:04, bill.zhou <
>
> > zgcsky08@
>
> > > wrote:
> >>
> >> > hi All
> >> >     Now when create the CarbonData table,if  the dimension don't add
> >> into
> >> > the dictionary_exclude properties, the dimension will be consider as
> >> > dictionary default. I think default should be no dictionary.
> >> >
> >> >     For example when I do the POC for one customer, it has 300 columns
> >> and
> >> > 200 dimensions, but only 5 columns is used for filter, so he only need
> >> set
> >> > this 5 columns to dictionary and leave other 195 columns to no
> >> dictionary.
> >> > But now he need specify for the 195 columns to dictionary_exclude
> >> > properties
> >> > the will waste time and make the create table command huge, also will
> >> > impact
> >> > the load performance.
> >> >
> >> >     So I suggestion dimension default should be no dictionary and this
> >> can
> >> > also help customer easy to know the dictionary column which is useful.
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://apache-carbondata-
> >> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> >> > dimension-default-should-be-no-dictionary-tp8010.html
> >> > Sent from the Apache CarbonData Mailing List archive mailing list
> >> archive
> >> > at Nabble.com.
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Ravi
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-tp8010p8027.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

ravipesala
In reply to this post by Liang Chen
Hi Likun,

You mentioned that if user does not specify dictionary columns then by
default those are chosen as no dictionary columns.
But we have many disadvantages as I mentioned in above mail if you keep no
dictionary as default. We have initially introduced no dictionary columns
to handle high cardinality dimensions, but now making every thing as no
dictionary columns by default looses our unique feature compare to parquet.
Dictionary columns are introduced not only for aggregation queries, it is
for better compression and better filter queries as well. With out
dictionary store size will be increased a lot.

Regards,
Ravindra.

On 28 February 2017 at 18:05, Liang Chen <[hidden email]> wrote:

> Hi
>
> A couple of questions:
>
> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> index" for these columns which be specified into the option(SORT_KEY)  ?
>
> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> dictionary encoding, and all shuffle operations are based on fact value, is
> my understanding right ?
> ------------------------------------------------------------
> -------------------------------------------
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
>
> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> TABLE_DICTIONARY, then system how to handle this case ?
> ------------------------------------------------------------
> -----------------------------------------------
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
>
> Regards
> Liang
>
> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>
> > Yes, first we should simplify the DDL options. I propose following
> options,
> > please check weather it miss some scenario.
> >
> > 1. SORT_COLUMNS, or SORT_KEY
> > This indicates three things:
> > 1) All columns specified in options will be used to construct
> > Multi-Dimensional Key, which will be sorted along this key
> > 2) They will be encoded as Inverted Index and thus again sorted within
> > column chunk in one blocklet
> > 3) Minmax index will also be created for these columns
> >
> > When to use: This option is designed for accelerating filter query, so
> put
> > all filter columns into this option. The order of it can be:
> > 1) From low cardinality to high cardinality, this will make most
> > compression
> > and fit for scenario that does not have frequent filter on high card
> column
> > 2) Put high cardinality column first, then put others. This fits for
> > frequent filter on high card column
> >
> > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> as
> > Inverted Index and with Minmax Index
> > Note that while C1,C2,C3 can be dimension but they also can be measure.
> So
> > if user need to filter on measure column, it can be put in SORT_COLUMNS
> > option.
> >
> > If this option is not specified by user, carbon will pick MDK as it is
> now.
> >
> > 2. TABLE_DICTIONARY
> > This is to specify the table level dictionary columns. Will create global
> > dictionary for all columns in this option for every data load.
> >
> > When to use: The option is designed for accelerating aggregate query, so
> > put
> > group by columns into this option
> >
> > For example. TABLE_DICTIONARY=“C2,C3,C5”
> >
> > If this option is not specified by user, means all columns encoding
> without
> > global dictionary support. Normal shuffle on decoded value will be
> applied
> > when doing group by operation.
> >
> > I think these two options should be the basic option for normal user, the
> > goal of them is to satisfy the most scenario without deep tuning of the
> > table
> > For advanced user who want to do deep tuning, we can debate to add more
> > options. But we need to identify what scenario is not satisfied by using
> > these two options first.
> >
> > Regards,
> > Jacky
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > dimension-default-should-be-no-dictionary-tp8010p8081.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Regards
> Liang
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

Jacky Li
In reply to this post by Liang Chen

> 在 2017年2月28日,下午8:35,Liang Chen <[hidden email]> 写道:
>
> Hi
>
> A couple of questions:
>
> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> index" for these columns which be specified into the option(SORT_KEY)  ?
>
Yes, build MDK index, inverted index, minimax index for columns in SORT_KEY

> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> dictionary encoding, and all shuffle operations are based on fact value, is
> my understanding right ?
> -------------------------------------------------------------------------------------------------------
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
>
Yes

> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> TABLE_DICTIONARY, then system how to handle this case ?
> -----------------------------------------------------------------------------------------------------------
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
>
Sort it using original value

> Regards
> Liang
>
> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>
>> Yes, first we should simplify the DDL options. I propose following options,
>> please check weather it miss some scenario.
>>
>> 1. SORT_COLUMNS, or SORT_KEY
>> This indicates three things:
>> 1) All columns specified in options will be used to construct
>> Multi-Dimensional Key, which will be sorted along this key
>> 2) They will be encoded as Inverted Index and thus again sorted within
>> column chunk in one blocklet
>> 3) Minmax index will also be created for these columns
>>
>> When to use: This option is designed for accelerating filter query, so put
>> all filter columns into this option. The order of it can be:
>> 1) From low cardinality to high cardinality, this will make most
>> compression
>> and fit for scenario that does not have frequent filter on high card column
>> 2) Put high cardinality column first, then put others. This fits for
>> frequent filter on high card column
>>
>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
>> Inverted Index and with Minmax Index
>> Note that while C1,C2,C3 can be dimension but they also can be measure. So
>> if user need to filter on measure column, it can be put in SORT_COLUMNS
>> option.
>>
>> If this option is not specified by user, carbon will pick MDK as it is now.
>>
>> 2. TABLE_DICTIONARY
>> This is to specify the table level dictionary columns. Will create global
>> dictionary for all columns in this option for every data load.
>>
>> When to use: The option is designed for accelerating aggregate query, so
>> put
>> group by columns into this option
>>
>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>
>> If this option is not specified by user, means all columns encoding without
>> global dictionary support. Normal shuffle on decoded value will be applied
>> when doing group by operation.
>>
>> I think these two options should be the basic option for normal user, the
>> goal of them is to satisfy the most scenario without deep tuning of the
>> table
>> For advanced user who want to do deep tuning, we can debate to add more
>> options. But we need to identify what scenario is not satisfied by using
>> these two options first.
>>
>> Regards,
>> Jacky
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
> --
> Regards
> Liang



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

Jacky Li
In reply to this post by ravipesala
Yes, I agree to your point. The only concern I have is for loading, I have seen many users accidentally put high cardinality column into dictionary column then the loading failed because out of memory or loading very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for these columns, or they do not have a easy way to identify the high card columns. I feel preventing such misusage is important in order to encourage more users to use carbondata.

Any suggestion on solving this issue?


Regards,
Likun


> 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道:
>
> Hi Likun,
>
> You mentioned that if user does not specify dictionary columns then by
> default those are chosen as no dictionary columns.
> But we have many disadvantages as I mentioned in above mail if you keep no
> dictionary as default. We have initially introduced no dictionary columns
> to handle high cardinality dimensions, but now making every thing as no
> dictionary columns by default looses our unique feature compare to parquet.
> Dictionary columns are introduced not only for aggregation queries, it is
> for better compression and better filter queries as well. With out
> dictionary store size will be increased a lot.
>
> Regards,
> Ravindra.
>
> On 28 February 2017 at 18:05, Liang Chen <[hidden email]> wrote:
>
>> Hi
>>
>> A couple of questions:
>>
>> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>> index" for these columns which be specified into the option(SORT_KEY)  ?
>>
>> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
>> dictionary encoding, and all shuffle operations are based on fact value, is
>> my understanding right ?
>> ------------------------------------------------------------
>> -------------------------------------------
>> If this option is not specified by user, means all columns encoding without
>> global dictionary support. Normal shuffle on decoded value will be applied
>> when doing group by operation.
>>
>> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>> supposed  if "C2" be specified into SORT_KEY, but not be specified into
>> TABLE_DICTIONARY, then system how to handle this case ?
>> ------------------------------------------------------------
>> -----------------------------------------------
>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
>> Inverted Index and with Minmax Index
>>
>> Regards
>> Liang
>>
>> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>>
>>> Yes, first we should simplify the DDL options. I propose following
>> options,
>>> please check weather it miss some scenario.
>>>
>>> 1. SORT_COLUMNS, or SORT_KEY
>>> This indicates three things:
>>> 1) All columns specified in options will be used to construct
>>> Multi-Dimensional Key, which will be sorted along this key
>>> 2) They will be encoded as Inverted Index and thus again sorted within
>>> column chunk in one blocklet
>>> 3) Minmax index will also be created for these columns
>>>
>>> When to use: This option is designed for accelerating filter query, so
>> put
>>> all filter columns into this option. The order of it can be:
>>> 1) From low cardinality to high cardinality, this will make most
>>> compression
>>> and fit for scenario that does not have frequent filter on high card
>> column
>>> 2) Put high cardinality column first, then put others. This fits for
>>> frequent filter on high card column
>>>
>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
>> as
>>> Inverted Index and with Minmax Index
>>> Note that while C1,C2,C3 can be dimension but they also can be measure.
>> So
>>> if user need to filter on measure column, it can be put in SORT_COLUMNS
>>> option.
>>>
>>> If this option is not specified by user, carbon will pick MDK as it is
>> now.
>>>
>>> 2. TABLE_DICTIONARY
>>> This is to specify the table level dictionary columns. Will create global
>>> dictionary for all columns in this option for every data load.
>>>
>>> When to use: The option is designed for accelerating aggregate query, so
>>> put
>>> group by columns into this option
>>>
>>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>>
>>> If this option is not specified by user, means all columns encoding
>> without
>>> global dictionary support. Normal shuffle on decoded value will be
>> applied
>>> when doing group by operation.
>>>
>>> I think these two options should be the basic option for normal user, the
>>> goal of them is to satisfy the most scenario without deep tuning of the
>>> table
>>> For advanced user who want to do deep tuning, we can debate to add more
>>> options. But we need to identify what scenario is not satisfied by using
>>> these two options first.
>>>
>>> Regards,
>>> Jacky
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-carbondata-
>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>>> at Nabble.com.
>>>
>>
>>
>>
>> --
>> Regards
>> Liang
>>
>
>
> --
> Thanks & Regards,
> Ravi



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

David CaiQiang
In reply to this post by Jacky Li
+1

It is not easy for user to understand the previous options.
The logic of this two options SORT_COLUMNS AND TABLE_DICTIOANRY  is very clear.
I am coding to implement SORT_COLUMNS option by this way.

Best Regards
David Caiqiang
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

ravipesala
In reply to this post by Jacky Li
Hi Likun,

It would be same case if we use all non dictionary columns by default, it
will increase the store size and decrease the performance so it is also
does not encourage more users if performance is poor.

If we need to make no-dictionary columns as default then we should first
focus on reducing the store size and improve the filter queries on
non-dictionary columns.Even memory usage is higher while querying the
non-dictionary columns.

Regards,
Ravindra.

On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote:

> Yes, I agree to your point. The only concern I have is for loading, I have
> seen many users accidentally put high cardinality column into dictionary
> column then the loading failed because out of memory or loading very slow.
> I guess they just do not know to use DICTIONARY_EXCLUDE for these columns,
> or they do not have a easy way to identify the high card columns. I feel
> preventing such misusage is important in order to encourage more users to
> use carbondata.
>
> Any suggestion on solving this issue?
>
>
> Regards,
> Likun
>
>
> > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道:
> >
> > Hi Likun,
> >
> > You mentioned that if user does not specify dictionary columns then by
> > default those are chosen as no dictionary columns.
> > But we have many disadvantages as I mentioned in above mail if you keep
> no
> > dictionary as default. We have initially introduced no dictionary columns
> > to handle high cardinality dimensions, but now making every thing as no
> > dictionary columns by default looses our unique feature compare to
> parquet.
> > Dictionary columns are introduced not only for aggregation queries, it is
> > for better compression and better filter queries as well. With out
> > dictionary store size will be increased a lot.
> >
> > Regards,
> > Ravindra.
> >
> > On 28 February 2017 at 18:05, Liang Chen <[hidden email]>
> wrote:
> >
> >> Hi
> >>
> >> A couple of questions:
> >>
> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> >> index" for these columns which be specified into the option(SORT_KEY)  ?
> >>
> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> >> dictionary encoding, and all shuffle operations are based on fact
> value, is
> >> my understanding right ?
> >> ------------------------------------------------------------
> >> -------------------------------------------
> >> If this option is not specified by user, means all columns encoding
> without
> >> global dictionary support. Normal shuffle on decoded value will be
> applied
> >> when doing group by operation.
> >>
> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> >> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> >> TABLE_DICTIONARY, then system how to handle this case ?
> >> ------------------------------------------------------------
> >> -----------------------------------------------
> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> as
> >> Inverted Index and with Minmax Index
> >>
> >> Regards
> >> Liang
> >>
> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
> >>
> >>> Yes, first we should simplify the DDL options. I propose following
> >> options,
> >>> please check weather it miss some scenario.
> >>>
> >>> 1. SORT_COLUMNS, or SORT_KEY
> >>> This indicates three things:
> >>> 1) All columns specified in options will be used to construct
> >>> Multi-Dimensional Key, which will be sorted along this key
> >>> 2) They will be encoded as Inverted Index and thus again sorted within
> >>> column chunk in one blocklet
> >>> 3) Minmax index will also be created for these columns
> >>>
> >>> When to use: This option is designed for accelerating filter query, so
> >> put
> >>> all filter columns into this option. The order of it can be:
> >>> 1) From low cardinality to high cardinality, this will make most
> >>> compression
> >>> and fit for scenario that does not have frequent filter on high card
> >> column
> >>> 2) Put high cardinality column first, then put others. This fits for
> >>> frequent filter on high card column
> >>>
> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> >> as
> >>> Inverted Index and with Minmax Index
> >>> Note that while C1,C2,C3 can be dimension but they also can be measure.
> >> So
> >>> if user need to filter on measure column, it can be put in SORT_COLUMNS
> >>> option.
> >>>
> >>> If this option is not specified by user, carbon will pick MDK as it is
> >> now.
> >>>
> >>> 2. TABLE_DICTIONARY
> >>> This is to specify the table level dictionary columns. Will create
> global
> >>> dictionary for all columns in this option for every data load.
> >>>
> >>> When to use: The option is designed for accelerating aggregate query,
> so
> >>> put
> >>> group by columns into this option
> >>>
> >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
> >>>
> >>> If this option is not specified by user, means all columns encoding
> >> without
> >>> global dictionary support. Normal shuffle on decoded value will be
> >> applied
> >>> when doing group by operation.
> >>>
> >>> I think these two options should be the basic option for normal user,
> the
> >>> goal of them is to satisfy the most scenario without deep tuning of the
> >>> table
> >>> For advanced user who want to do deep tuning, we can debate to add more
> >>> options. But we need to identify what scenario is not satisfied by
> using
> >>> these two options first.
> >>>
> >>> Regards,
> >>> Jacky
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context: http://apache-carbondata-
> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
> >>> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >>> at Nabble.com.
> >>>
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

kumarvishal09
Hi Jacky,
I agree with Ravindra's point by making no dictionary column by default
will increase the store size and it will impact IO+ currently in carbon for
no dictionary column only String data type is supported, so we cannot set
dimension column as no dictionary column by default.

-Regards
Kumar Vishal

On Wed, Mar 1, 2017 at 12:42 PM, Ravindra Pesala <[hidden email]>
wrote:

> Hi Likun,
>
> It would be same case if we use all non dictionary columns by default, it
> will increase the store size and decrease the performance so it is also
> does not encourage more users if performance is poor.
>
> If we need to make no-dictionary columns as default then we should first
> focus on reducing the store size and improve the filter queries on
> non-dictionary columns.Even memory usage is higher while querying the
> non-dictionary columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote:
>
> > Yes, I agree to your point. The only concern I have is for loading, I
> have
> > seen many users accidentally put high cardinality column into dictionary
> > column then the loading failed because out of memory or loading very
> slow.
> > I guess they just do not know to use DICTIONARY_EXCLUDE for these
> columns,
> > or they do not have a easy way to identify the high card columns. I feel
> > preventing such misusage is important in order to encourage more users to
> > use carbondata.
> >
> > Any suggestion on solving this issue?
> >
> >
> > Regards,
> > Likun
> >
> >
> > > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道:
> > >
> > > Hi Likun,
> > >
> > > You mentioned that if user does not specify dictionary columns then by
> > > default those are chosen as no dictionary columns.
> > > But we have many disadvantages as I mentioned in above mail if you keep
> > no
> > > dictionary as default. We have initially introduced no dictionary
> columns
> > > to handle high cardinality dimensions, but now making every thing as no
> > > dictionary columns by default looses our unique feature compare to
> > parquet.
> > > Dictionary columns are introduced not only for aggregation queries, it
> is
> > > for better compression and better filter queries as well. With out
> > > dictionary store size will be increased a lot.
> > >
> > > Regards,
> > > Ravindra.
> > >
> > > On 28 February 2017 at 18:05, Liang Chen <[hidden email]>
> > wrote:
> > >
> > >> Hi
> > >>
> > >> A couple of questions:
> > >>
> > >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> > >> index" for these columns which be specified into the
> option(SORT_KEY)  ?
> > >>
> > >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
> make
> > >> dictionary encoding, and all shuffle operations are based on fact
> > value, is
> > >> my understanding right ?
> > >> ------------------------------------------------------------
> > >> -------------------------------------------
> > >> If this option is not specified by user, means all columns encoding
> > without
> > >> global dictionary support. Normal shuffle on decoded value will be
> > applied
> > >> when doing group by operation.
> > >>
> > >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> > >> supposed  if "C2" be specified into SORT_KEY, but not be specified
> into
> > >> TABLE_DICTIONARY, then system how to handle this case ?
> > >> ------------------------------------------------------------
> > >> -----------------------------------------------
> > >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
> encoded
> > as
> > >> Inverted Index and with Minmax Index
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
> > >>
> > >>> Yes, first we should simplify the DDL options. I propose following
> > >> options,
> > >>> please check weather it miss some scenario.
> > >>>
> > >>> 1. SORT_COLUMNS, or SORT_KEY
> > >>> This indicates three things:
> > >>> 1) All columns specified in options will be used to construct
> > >>> Multi-Dimensional Key, which will be sorted along this key
> > >>> 2) They will be encoded as Inverted Index and thus again sorted
> within
> > >>> column chunk in one blocklet
> > >>> 3) Minmax index will also be created for these columns
> > >>>
> > >>> When to use: This option is designed for accelerating filter query,
> so
> > >> put
> > >>> all filter columns into this option. The order of it can be:
> > >>> 1) From low cardinality to high cardinality, this will make most
> > >>> compression
> > >>> and fit for scenario that does not have frequent filter on high card
> > >> column
> > >>> 2) Put high cardinality column first, then put others. This fits for
> > >>> frequent filter on high card column
> > >>>
> > >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
> encoded
> > >> as
> > >>> Inverted Index and with Minmax Index
> > >>> Note that while C1,C2,C3 can be dimension but they also can be
> measure.
> > >> So
> > >>> if user need to filter on measure column, it can be put in
> SORT_COLUMNS
> > >>> option.
> > >>>
> > >>> If this option is not specified by user, carbon will pick MDK as it
> is
> > >> now.
> > >>>
> > >>> 2. TABLE_DICTIONARY
> > >>> This is to specify the table level dictionary columns. Will create
> > global
> > >>> dictionary for all columns in this option for every data load.
> > >>>
> > >>> When to use: The option is designed for accelerating aggregate query,
> > so
> > >>> put
> > >>> group by columns into this option
> > >>>
> > >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
> > >>>
> > >>> If this option is not specified by user, means all columns encoding
> > >> without
> > >>> global dictionary support. Normal shuffle on decoded value will be
> > >> applied
> > >>> when doing group by operation.
> > >>>
> > >>> I think these two options should be the basic option for normal user,
> > the
> > >>> goal of them is to satisfy the most scenario without deep tuning of
> the
> > >>> table
> > >>> For advanced user who want to do deep tuning, we can debate to add
> more
> > >>> options. But we need to identify what scenario is not satisfied by
> > using
> > >>> these two options first.
> > >>>
> > >>> Regards,
> > >>> Jacky
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> View this message in context: http://apache-carbondata-
> > >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> > >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
> > >>> Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > >>> at Nabble.com.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Liang
> > >>
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> >
> >
> >
> >
>
>
> --
> Thanks & Regards,
> Ravi
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

ravipesala
In reply to this post by ravipesala
Hi All,

In order to make no-dictionary columns as default we should improve the
storage and performance for these columns. I have sent another mail to
discuss the improvement points. Please comment on it.

Regards,
Ravindra

On 1 March 2017 at 10:12, Ravindra Pesala <[hidden email]> wrote:

> Hi Likun,
>
> It would be same case if we use all non dictionary columns by default, it
> will increase the store size and decrease the performance so it is also
> does not encourage more users if performance is poor.
>
> If we need to make no-dictionary columns as default then we should first
> focus on reducing the store size and improve the filter queries on
> non-dictionary columns.Even memory usage is higher while querying the
> non-dictionary columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote:
>
>> Yes, I agree to your point. The only concern I have is for loading, I
>> have seen many users accidentally put high cardinality column into
>> dictionary column then the loading failed because out of memory or loading
>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>> these columns, or they do not have a easy way to identify the high card
>> columns. I feel preventing such misusage is important in order to encourage
>> more users to use carbondata.
>>
>> Any suggestion on solving this issue?
>>
>>
>> Regards,
>> Likun
>>
>>
>> > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道:
>> >
>> > Hi Likun,
>> >
>> > You mentioned that if user does not specify dictionary columns then by
>> > default those are chosen as no dictionary columns.
>> > But we have many disadvantages as I mentioned in above mail if you keep
>> no
>> > dictionary as default. We have initially introduced no dictionary
>> columns
>> > to handle high cardinality dimensions, but now making every thing as no
>> > dictionary columns by default looses our unique feature compare to
>> parquet.
>> > Dictionary columns are introduced not only for aggregation queries, it
>> is
>> > for better compression and better filter queries as well. With out
>> > dictionary store size will be increased a lot.
>> >
>> > Regards,
>> > Ravindra.
>> >
>> > On 28 February 2017 at 18:05, Liang Chen <[hidden email]>
>> wrote:
>> >
>> >> Hi
>> >>
>> >> A couple of questions:
>> >>
>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>> >> index" for these columns which be specified into the option(SORT_KEY)
>> ?
>> >>
>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>> make
>> >> dictionary encoding, and all shuffle operations are based on fact
>> value, is
>> >> my understanding right ?
>> >> ------------------------------------------------------------
>> >> -------------------------------------------
>> >> If this option is not specified by user, means all columns encoding
>> without
>> >> global dictionary support. Normal shuffle on decoded value will be
>> applied
>> >> when doing group by operation.
>> >>
>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified into
>> >> TABLE_DICTIONARY, then system how to handle this case ?
>> >> ------------------------------------------------------------
>> >> -----------------------------------------------
>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded as
>> >> Inverted Index and with Minmax Index
>> >>
>> >> Regards
>> >> Liang
>> >>
>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>> >>
>> >>> Yes, first we should simplify the DDL options. I propose following
>> >> options,
>> >>> please check weather it miss some scenario.
>> >>>
>> >>> 1. SORT_COLUMNS, or SORT_KEY
>> >>> This indicates three things:
>> >>> 1) All columns specified in options will be used to construct
>> >>> Multi-Dimensional Key, which will be sorted along this key
>> >>> 2) They will be encoded as Inverted Index and thus again sorted within
>> >>> column chunk in one blocklet
>> >>> 3) Minmax index will also be created for these columns
>> >>>
>> >>> When to use: This option is designed for accelerating filter query, so
>> >> put
>> >>> all filter columns into this option. The order of it can be:
>> >>> 1) From low cardinality to high cardinality, this will make most
>> >>> compression
>> >>> and fit for scenario that does not have frequent filter on high card
>> >> column
>> >>> 2) Put high cardinality column first, then put others. This fits for
>> >>> frequent filter on high card column
>> >>>
>> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded
>> >> as
>> >>> Inverted Index and with Minmax Index
>> >>> Note that while C1,C2,C3 can be dimension but they also can be
>> measure.
>> >> So
>> >>> if user need to filter on measure column, it can be put in
>> SORT_COLUMNS
>> >>> option.
>> >>>
>> >>> If this option is not specified by user, carbon will pick MDK as it is
>> >> now.
>> >>>
>> >>> 2. TABLE_DICTIONARY
>> >>> This is to specify the table level dictionary columns. Will create
>> global
>> >>> dictionary for all columns in this option for every data load.
>> >>>
>> >>> When to use: The option is designed for accelerating aggregate query,
>> so
>> >>> put
>> >>> group by columns into this option
>> >>>
>> >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>> >>>
>> >>> If this option is not specified by user, means all columns encoding
>> >> without
>> >>> global dictionary support. Normal shuffle on decoded value will be
>> >> applied
>> >>> when doing group by operation.
>> >>>
>> >>> I think these two options should be the basic option for normal user,
>> the
>> >>> goal of them is to satisfy the most scenario without deep tuning of
>> the
>> >>> table
>> >>> For advanced user who want to do deep tuning, we can debate to add
>> more
>> >>> options. But we need to identify what scenario is not satisfied by
>> using
>> >>> these two options first.
>> >>>
>> >>> Regards,
>> >>> Jacky
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context: http://apache-carbondata-
>> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>> >>> Sent from the Apache CarbonData Mailing List archive mailing list
>> archive
>> >>> at Nabble.com.
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards
>> >> Liang
>> >>
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

bill.zhou
hi All
 I summary this discussion.
1. to make carbonData compatibility for older vesion, keep DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not suggestion change this properties to table_dictionary.
2. Suggestion keep the sort_column properties as the same style for dictionary. so this new properties suggestion use SORT_INCLUDE and SORT_EXCLUDE, default is no sort.

Regards
Bill

ravipesala wrote
Hi All,

In order to make no-dictionary columns as default we should improve the
storage and performance for these columns. I have sent another mail to
discuss the improvement points. Please comment on it.

Regards,
Ravindra

On 1 March 2017 at 10:12, Ravindra Pesala <[hidden email]> wrote:

> Hi Likun,
>
> It would be same case if we use all non dictionary columns by default, it
> will increase the store size and decrease the performance so it is also
> does not encourage more users if performance is poor.
>
> If we need to make no-dictionary columns as default then we should first
> focus on reducing the store size and improve the filter queries on
> non-dictionary columns.Even memory usage is higher while querying the
> non-dictionary columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote:
>
>> Yes, I agree to your point. The only concern I have is for loading, I
>> have seen many users accidentally put high cardinality column into
>> dictionary column then the loading failed because out of memory or loading
>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>> these columns, or they do not have a easy way to identify the high card
>> columns. I feel preventing such misusage is important in order to encourage
>> more users to use carbondata.
>>
>> Any suggestion on solving this issue?
>>
>>
>> Regards,
>> Likun
>>
>>
>> > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道:
>> >
>> > Hi Likun,
>> >
>> > You mentioned that if user does not specify dictionary columns then by
>> > default those are chosen as no dictionary columns.
>> > But we have many disadvantages as I mentioned in above mail if you keep
>> no
>> > dictionary as default. We have initially introduced no dictionary
>> columns
>> > to handle high cardinality dimensions, but now making every thing as no
>> > dictionary columns by default looses our unique feature compare to
>> parquet.
>> > Dictionary columns are introduced not only for aggregation queries, it
>> is
>> > for better compression and better filter queries as well. With out
>> > dictionary store size will be increased a lot.
>> >
>> > Regards,
>> > Ravindra.
>> >
>> > On 28 February 2017 at 18:05, Liang Chen <[hidden email]>
>> wrote:
>> >
>> >> Hi
>> >>
>> >> A couple of questions:
>> >>
>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>> >> index" for these columns which be specified into the option(SORT_KEY)
>> ?
>> >>
>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>> make
>> >> dictionary encoding, and all shuffle operations are based on fact
>> value, is
>> >> my understanding right ?
>> >> ------------------------------------------------------------
>> >> -------------------------------------------
>> >> If this option is not specified by user, means all columns encoding
>> without
>> >> global dictionary support. Normal shuffle on decoded value will be
>> applied
>> >> when doing group by operation.
>> >>
>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified into
>> >> TABLE_DICTIONARY, then system how to handle this case ?
>> >> ------------------------------------------------------------
>> >> -----------------------------------------------
>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded as
>> >> Inverted Index and with Minmax Index
>> >>
>> >> Regards
>> >> Liang
>> >>
>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>> >>
>> >>> Yes, first we should simplify the DDL options. I propose following
>> >> options,
>> >>> please check weather it miss some scenario.
>> >>>
>> >>> 1. SORT_COLUMNS, or SORT_KEY
>> >>> This indicates three things:
>> >>> 1) All columns specified in options will be used to construct
>> >>> Multi-Dimensional Key, which will be sorted along this key
>> >>> 2) They will be encoded as Inverted Index and thus again sorted within
>> >>> column chunk in one blocklet
>> >>> 3) Minmax index will also be created for these columns
>> >>>
>> >>> When to use: This option is designed for accelerating filter query, so
>> >> put
>> >>> all filter columns into this option. The order of it can be:
>> >>> 1) From low cardinality to high cardinality, this will make most
>> >>> compression
>> >>> and fit for scenario that does not have frequent filter on high card
>> >> column
>> >>> 2) Put high cardinality column first, then put others. This fits for
>> >>> frequent filter on high card column
>> >>>
>> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded
>> >> as
>> >>> Inverted Index and with Minmax Index
>> >>> Note that while C1,C2,C3 can be dimension but they also can be
>> measure.
>> >> So
>> >>> if user need to filter on measure column, it can be put in
>> SORT_COLUMNS
>> >>> option.
>> >>>
>> >>> If this option is not specified by user, carbon will pick MDK as it is
>> >> now.
>> >>>
>> >>> 2. TABLE_DICTIONARY
>> >>> This is to specify the table level dictionary columns. Will create
>> global
>> >>> dictionary for all columns in this option for every data load.
>> >>>
>> >>> When to use: The option is designed for accelerating aggregate query,
>> so
>> >>> put
>> >>> group by columns into this option
>> >>>
>> >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>> >>>
>> >>> If this option is not specified by user, means all columns encoding
>> >> without
>> >>> global dictionary support. Normal shuffle on decoded value will be
>> >> applied
>> >>> when doing group by operation.
>> >>>
>> >>> I think these two options should be the basic option for normal user,
>> the
>> >>> goal of them is to satisfy the most scenario without deep tuning of
>> the
>> >>> table
>> >>> For advanced user who want to do deep tuning, we can debate to add
>> more
>> >>> options. But we need to identify what scenario is not satisfied by
>> using
>> >>> these two options first.
>> >>>
>> >>> Regards,
>> >>> Jacky
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context: http://apache-carbondata-
>> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>> >>> Sent from the Apache CarbonData Mailing List archive mailing list
>> archive
>> >>> at Nabble.com.
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards
>> >> Liang
>> >>
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] For the dimension default should be no dictionary

Jacky Li
Hi Bill,

1. I think Ravindra and Vishal’s point is valid, we should keep default is dictionary before we have improved performance of no-dictionary column.
We are discussing this in another thread in mail list.

2. For sorting, default should be carbon’s current behavior (picking dimension according to default rule automatically as the MDK). If user specify SORT_COLUMNS, then use it. I think SORT_EXCLUDE is not required.

Regards,
Jacky

> 在 2017年3月3日,上午12:22,bill.zhou <[hidden email]> 写道:
>
> hi All
> I summary this discussion.
> 1. to make carbonData compatibility for older vesion, keep
> DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not
> suggestion change this properties to table_dictionary.
> 2. Suggestion keep the sort_column properties as the same style for
> dictionary. so this new properties suggestion use SORT_INCLUDE and
> SORT_EXCLUDE, default is no sort.
>
> Regards
> Bill
>
>
> ravipesala wrote
>> Hi All,
>>
>> In order to make no-dictionary columns as default we should improve the
>> storage and performance for these columns. I have sent another mail to
>> discuss the improvement points. Please comment on it.
>>
>> Regards,
>> Ravindra
>>
>> On 1 March 2017 at 10:12, Ravindra Pesala &lt;
>
>> ravi.pesala@
>
>> &gt; wrote:
>>
>>> Hi Likun,
>>>
>>> It would be same case if we use all non dictionary columns by default, it
>>> will increase the store size and decrease the performance so it is also
>>> does not encourage more users if performance is poor.
>>>
>>> If we need to make no-dictionary columns as default then we should first
>>> focus on reducing the store size and improve the filter queries on
>>> non-dictionary columns.Even memory usage is higher while querying the
>>> non-dictionary columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 06:00, Jacky Li &lt;
>
>> jacky.likun@
>
>> &gt; wrote:
>>>
>>>> Yes, I agree to your point. The only concern I have is for loading, I
>>>> have seen many users accidentally put high cardinality column into
>>>> dictionary column then the loading failed because out of memory or
>>>> loading
>>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>>>> these columns, or they do not have a easy way to identify the high card
>>>> columns. I feel preventing such misusage is important in order to
>>>> encourage
>>>> more users to use carbondata.
>>>>
>>>> Any suggestion on solving this issue?
>>>>
>>>>
>>>> Regards,
>>>> Likun
>>>>
>>>>
>>>>> 在 2017年2月28日,下午10:20,Ravindra Pesala &lt;
>
>> ravi.pesala@
>
>> &gt; 写道:
>>>>>
>>>>> Hi Likun,
>>>>>
>>>>> You mentioned that if user does not specify dictionary columns then by
>>>>> default those are chosen as no dictionary columns.
>>>>> But we have many disadvantages as I mentioned in above mail if you
>>>> keep
>>>> no
>>>>> dictionary as default. We have initially introduced no dictionary
>>>> columns
>>>>> to handle high cardinality dimensions, but now making every thing as
>>>> no
>>>>> dictionary columns by default looses our unique feature compare to
>>>> parquet.
>>>>> Dictionary columns are introduced not only for aggregation queries, it
>>>> is
>>>>> for better compression and better filter queries as well. With out
>>>>> dictionary store size will be increased a lot.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>> On 28 February 2017 at 18:05, Liang Chen &lt;
>
>> chenliang6136@
>
>> &gt;
>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> A couple of questions:
>>>>>>
>>>>>> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>>>>>> index" for these columns which be specified into the option(SORT_KEY)
>>>> ?
>>>>>>
>>>>>> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>>>> make
>>>>>> dictionary encoding, and all shuffle operations are based on fact
>>>> value, is
>>>>>> my understanding right ?
>>>>>> ------------------------------------------------------------
>>>>>> -------------------------------------------
>>>>>> If this option is not specified by user, means all columns encoding
>>>> without
>>>>>> global dictionary support. Normal shuffle on decoded value will be
>>>> applied
>>>>>> when doing group by operation.
>>>>>>
>>>>>> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>>>>>> supposed  if "C2" be specified into SORT_KEY, but not be specified
>>>> into
>>>>>> TABLE_DICTIONARY, then system how to handle this case ?
>>>>>> ------------------------------------------------------------
>>>>>> -----------------------------------------------
>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>>> encoded as
>>>>>> Inverted Index and with Minmax Index
>>>>>>
>>>>>> Regards
>>>>>> Liang
>>>>>>
>>>>>> 2017-02-28 19:35 GMT+08:00 Jacky Li &lt;
>
>> jacky.likun@
>
>> &gt;:
>>>>>>
>>>>>>> Yes, first we should simplify the DDL options. I propose following
>>>>>> options,
>>>>>>> please check weather it miss some scenario.
>>>>>>>
>>>>>>> 1. SORT_COLUMNS, or SORT_KEY
>>>>>>> This indicates three things:
>>>>>>> 1) All columns specified in options will be used to construct
>>>>>>> Multi-Dimensional Key, which will be sorted along this key
>>>>>>> 2) They will be encoded as Inverted Index and thus again sorted
>>>> within
>>>>>>> column chunk in one blocklet
>>>>>>> 3) Minmax index will also be created for these columns
>>>>>>>
>>>>>>> When to use: This option is designed for accelerating filter query,
>>>> so
>>>>>> put
>>>>>>> all filter columns into this option. The order of it can be:
>>>>>>> 1) From low cardinality to high cardinality, this will make most
>>>>>>> compression
>>>>>>> and fit for scenario that does not have frequent filter on high card
>>>>>> column
>>>>>>> 2) Put high cardinality column first, then put others. This fits for
>>>>>>> frequent filter on high card column
>>>>>>>
>>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>>> encoded
>>>>>> as
>>>>>>> Inverted Index and with Minmax Index
>>>>>>> Note that while C1,C2,C3 can be dimension but they also can be
>>>> measure.
>>>>>> So
>>>>>>> if user need to filter on measure column, it can be put in
>>>> SORT_COLUMNS
>>>>>>> option.
>>>>>>>
>>>>>>> If this option is not specified by user, carbon will pick MDK as it
>>>> is
>>>>>> now.
>>>>>>>
>>>>>>> 2. TABLE_DICTIONARY
>>>>>>> This is to specify the table level dictionary columns. Will create
>>>> global
>>>>>>> dictionary for all columns in this option for every data load.
>>>>>>>
>>>>>>> When to use: The option is designed for accelerating aggregate
>>>> query,
>>>> so
>>>>>>> put
>>>>>>> group by columns into this option
>>>>>>>
>>>>>>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>>>>>>
>>>>>>> If this option is not specified by user, means all columns encoding
>>>>>> without
>>>>>>> global dictionary support. Normal shuffle on decoded value will be
>>>>>> applied
>>>>>>> when doing group by operation.
>>>>>>>
>>>>>>> I think these two options should be the basic option for normal
>>>> user,
>>>> the
>>>>>>> goal of them is to satisfy the most scenario without deep tuning of
>>>> the
>>>>>>> table
>>>>>>> For advanced user who want to do deep tuning, we can debate to add
>>>> more
>>>>>>> options. But we need to identify what scenario is not satisfied by
>>>> using
>>>>>>> these two options first.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Jacky
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context: http://apache-carbondata-
>>>>>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>>>>>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>>>>>>> Sent from the Apache CarbonData Mailing List archive mailing list
>>>> archive
>>>>>>> at Nabble.com.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards
>>>>>> Liang
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Ravi
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Ravi
>
>
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html>
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com <http://nabble.com/>.