hi All
Now when create the CarbonData table,if the dimension don't add into the dictionary_exclude properties, the dimension will be consider as dictionary default. I think default should be no dictionary. For example when I do the POC for one customer, it has 300 columns and 200 dimensions, but only 5 columns is used for filter, so he only need set this 5 columns to dictionary and leave other 195 columns to no dictionary. But now he need specify for the 195 columns to dictionary_exclude properties the will waste time and make the create table command huge, also will impact the load performance. So I suggestion dimension default should be no dictionary and this can also help customer easy to know the dictionary column which is useful. |
Hi,
I feel there are more disadvantages than advantages in this approach. In your current scenario you want to set dictionary only for columns which are used as filters, but the usage of dictionary is not only limited for filters, it can reduce the store size and improve the aggregation queries. I think you should set no_inverted_index false on non filtered columns to reduce the store size and improve the performance. If we make no dictionary as default then user no need set them in DDL but user needs to set the dictionary columns. If user wants to set more dictionary columns then the same problem what you mentioned arises again so it does not solve the problem. I feel we should give more flexibility in our DDL to simplify these scenarios and we should have more discussion on it. Pros & Cons of your suggestion. Advantages : 1. Decoding/Encoding of dictionary could be avoided. Disadvantages : 1. Store size will increase drastically. 2. IO will increase so query performance will come down. 3. Aggregation queries performance will suffer. Regards, Ravindra. On 26 February 2017 at 20:04, bill.zhou <[hidden email]> wrote: > hi All > Now when create the CarbonData table,if the dimension don't add into > the dictionary_exclude properties, the dimension will be consider as > dictionary default. I think default should be no dictionary. > > For example when I do the POC for one customer, it has 300 columns and > 200 dimensions, but only 5 columns is used for filter, so he only need set > this 5 columns to dictionary and leave other 195 columns to no dictionary. > But now he need specify for the 195 columns to dictionary_exclude > properties > the will waste time and make the create table command huge, also will > impact > the load performance. > > So I suggestion dimension default should be no dictionary and this can > also help customer easy to know the dictionary column which is useful. > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > dimension-default-should-be-no-dictionary-tp8010.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > -- Thanks & Regards, Ravi |
Hi,
I completely agree with Ravindra's points, more number of no dictionary column will impact the IO reading+writing both as in case of no dictionary data size will increase. Late decoding is one of main advantage, no dictionary column aggregation will be slower. Filter query will suffer as in case of dictionary column we are comparing on byte pack value, in case of no dictionary it will be on actual value. -Regards Kumar Vishal On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala <[hidden email]> wrote: > Hi, > > I feel there are more disadvantages than advantages in this approach. In > your current scenario you want to set dictionary only for columns which are > used as filters, but the usage of dictionary is not only limited for > filters, it can reduce the store size and improve the aggregation queries. > I think you should set no_inverted_index false on non filtered columns to > reduce the store size and improve the performance. > > If we make no dictionary as default then user no need set them in DDL but > user needs to set the dictionary columns. If user wants to set more > dictionary columns then the same problem what you mentioned arises again so > it does not solve the problem. I feel we should give more flexibility in > our DDL to simplify these scenarios and we should have more discussion on > it. > > Pros & Cons of your suggestion. > Advantages : > 1. Decoding/Encoding of dictionary could be avoided. > > Disadvantages : > 1. Store size will increase drastically. > 2. IO will increase so query performance will come down. > 3. Aggregation queries performance will suffer. > > > > Regards, > Ravindra. > > On 26 February 2017 at 20:04, bill.zhou <[hidden email]> wrote: > > > hi All > > Now when create the CarbonData table,if the dimension don't add into > > the dictionary_exclude properties, the dimension will be consider as > > dictionary default. I think default should be no dictionary. > > > > For example when I do the POC for one customer, it has 300 columns > and > > 200 dimensions, but only 5 columns is used for filter, so he only need > set > > this 5 columns to dictionary and leave other 195 columns to no > dictionary. > > But now he need specify for the 195 columns to dictionary_exclude > > properties > > the will waste time and make the create table command huge, also will > > impact > > the load performance. > > > > So I suggestion dimension default should be no dictionary and this > can > > also help customer easy to know the dictionary column which is useful. > > > > > > > > -- > > View this message in context: http://apache-carbondata- > > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > > dimension-default-should-be-no-dictionary-tp8010.html > > Sent from the Apache CarbonData Mailing List archive mailing list archive > > at Nabble.com. > > > > > > -- > Thanks & Regards, > Ravi >
kumar vishal
|
Dear Vishal & Ravindra
Thanks for you reply, I think I didn't describe it clearly so that you don't get full idea. 1. dictionary is important feature in CarbonData, for every new customer we will introduce this feature to him. So for new customer will know it clearly, will set the dictionary column when create table. 2. For all customer like bank customer, telecom customer and traffic customer have a same scenario is: have more column but only set few column as dictionary. like telecom customer, 300 column only set 5 column dictionary, other dim don't set dictionary. like bank customer, 100 column only set about 5 column dictionary, other dim don't set dictionary. For currently customer actually user scenario, they only set the dim which used for filter and group by related column as dictionary 3. mys suggestion is that: dim column default as no dictionary is only for the dim which not put into the dictionary_include properties, not for all dim column. If customer always used 5 columns add into dictionary_include and others column no dictionary, this will not impact the query performance. So that I suggestion the dim column default set as no dictionary which not added in to dictionary_include properties. Regards Bill
|
Hi Bill,
I got your point, but the solution of making no-dictionary as default may not be perfect solution. Basically no-dictionary columns are only meant for high cardinality dimensions, so the usage may change from user to user or scenario to scenario . This is the basic issue of usability of DDL, please first focus on to simplify DDL usability. For example we have 6 columns , we can mention DDL as below. case 1 : SORT_COLUMNS="C1,C2,C3" NON_SORT_COLUMNS="C4,C5,C6" In above case C1, C2 , C3 are sort columns and part of MDK key. And C4,C5,C6 are become non sort columns(measure/complex) DICTIONARY_EXCLUDE= 'ALL' DICTIONARY_INCLUDE='C3' In the above case all sort columns((C1,C2,C3) are non-dictionary columns except C3, here C3 is dictionary column. case 2: SORT_COLUMNS="ALL" NON_SORT_COLUMNS="C6" In this case all columns are sort columns except C6. DICTIONARY_EXCLUDE= 'C2' DICTIONARY_INCLUDE='ALL' In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns except C2, here C2 is no-dictionary column. Above mentioned are just my idea of how to simplify DDL to handle all scenarios. We can have more discussion towards it to simplify the DDL. Regards, Ravindra. On 27 February 2017 at 12:38, bill.zhou <[hidden email]> wrote: > Dear Vishal & Ravindra > > Thanks for you reply, I think I didn't describe it clearly so that you > don't get full idea. > 1. dictionary is important feature in CarbonData, for every new customer we > will introduce this feature to him. So for new customer will know it > clearly, will set the dictionary column when create table. > 2. For all customer like bank customer, telecom customer and traffic > customer have a same scenario is: have more column but only set few column > as dictionary. > like telecom customer, 300 column only set 5 column dictionary, other > dim don't set dictionary. > like bank customer, 100 column only set about 5 column dictionary, > other > dim don't set dictionary. > *For currently customer actually user scenario, they only set the dim which > used for filter and group by related column as dictionary* > 3. mys suggestion is that: dim column default as no dictionary is only for > the dim which not put into the dictionary_include properties, not for all > dim column. If customer always used 5 columns add into dictionary_include > and others column no dictionary, this will not impact the query > performance. > > So that I suggestion the dim column default set as no dictionary which not > added in to dictionary_include properties. > > Regards > Bill > > > > kumarvishal09 wrote > > Hi, > > I completely agree with Ravindra's points, more number of no > > dictionary > > column will impact the IO reading+writing both as in case of no > dictionary > > data size will increase. Late decoding is one of main advantage, no > > dictionary column aggregation will be slower. Filter query will suffer as > > in case of dictionary column we are comparing on byte pack value, in case > > of no dictionary it will be on actual value. > > > > -Regards > > Kumar Vishal > > > > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala < > > > ravi.pesala@ > > > > > > wrote: > > > >> Hi, > >> > >> I feel there are more disadvantages than advantages in this approach. In > >> your current scenario you want to set dictionary only for columns which > >> are > >> used as filters, but the usage of dictionary is not only limited for > >> filters, it can reduce the store size and improve the aggregation > >> queries. > >> I think you should set no_inverted_index false on non filtered columns > to > >> reduce the store size and improve the performance. > >> > >> If we make no dictionary as default then user no need set them in DDL > but > >> user needs to set the dictionary columns. If user wants to set more > >> dictionary columns then the same problem what you mentioned arises again > >> so > >> it does not solve the problem. I feel we should give more flexibility in > >> our DDL to simplify these scenarios and we should have more discussion > on > >> it. > >> > >> Pros & Cons of your suggestion. > >> Advantages : > >> 1. Decoding/Encoding of dictionary could be avoided. > >> > >> Disadvantages : > >> 1. Store size will increase drastically. > >> 2. IO will increase so query performance will come down. > >> 3. Aggregation queries performance will suffer. > >> > >> > >> > >> Regards, > >> Ravindra. > >> > >> On 26 February 2017 at 20:04, bill.zhou < > > > zgcsky08@ > > > > wrote: > >> > >> > hi All > >> > Now when create the CarbonData table,if the dimension don't add > >> into > >> > the dictionary_exclude properties, the dimension will be consider as > >> > dictionary default. I think default should be no dictionary. > >> > > >> > For example when I do the POC for one customer, it has 300 columns > >> and > >> > 200 dimensions, but only 5 columns is used for filter, so he only need > >> set > >> > this 5 columns to dictionary and leave other 195 columns to no > >> dictionary. > >> > But now he need specify for the 195 columns to dictionary_exclude > >> > properties > >> > the will waste time and make the create table command huge, also will > >> > impact > >> > the load performance. > >> > > >> > So I suggestion dimension default should be no dictionary and this > >> can > >> > also help customer easy to know the dictionary column which is useful. > >> > > >> > > >> > > >> > -- > >> > View this message in context: http://apache-carbondata- > >> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > >> > dimension-default-should-be-no-dictionary-tp8010.html > >> > Sent from the Apache CarbonData Mailing List archive mailing list > >> archive > >> > at Nabble.com. > >> > > >> > >> > >> > >> -- > >> Thanks & Regards, > >> Ravi > >> > > > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > dimension-default-should-be-no-dictionary-tp8010p8027.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > -- Thanks & Regards, Ravi |
Administrator
|
Hi
+1 , through adding "DICTIONARY_EXCLUDE= 'ALL' and DICTIONARY_INCLUDE= 'ALL' " to improve the usability of DDL. This solution is more flexible than put no-dictionary as default. Regards Liang 2017-02-27 20:27 GMT+08:00 Ravindra Pesala <[hidden email]>: > Hi Bill, > > I got your point, but the solution of making no-dictionary as default may > not be perfect solution. Basically no-dictionary columns are only meant for > high cardinality dimensions, so the usage may change from user to user or > scenario to scenario . > This is the basic issue of usability of DDL, please first focus on to > simplify DDL usability. > > For example we have 6 columns , we can mention DDL as below. > case 1 : > SORT_COLUMNS="C1,C2,C3" > NON_SORT_COLUMNS="C4,C5,C6" > In above case C1, C2 , C3 are sort columns and part of MDK key. And > C4,C5,C6 are become non sort columns(measure/complex) > > DICTIONARY_EXCLUDE= 'ALL' > DICTIONARY_INCLUDE='C3' > In the above case all sort columns((C1,C2,C3) are non-dictionary columns > except C3, here C3 is dictionary column. > > case 2: > SORT_COLUMNS="ALL" > NON_SORT_COLUMNS="C6" > In this case all columns are sort columns except C6. > > DICTIONARY_EXCLUDE= 'C2' > DICTIONARY_INCLUDE='ALL' > In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns > except C2, here C2 is no-dictionary column. > > Above mentioned are just my idea of how to simplify DDL to handle all > scenarios. We can have more discussion towards it to simplify the DDL. > > Regards, > Ravindra. > > On 27 February 2017 at 12:38, bill.zhou <[hidden email]> wrote: > > > Dear Vishal & Ravindra > > > > Thanks for you reply, I think I didn't describe it clearly so that you > > don't get full idea. > > 1. dictionary is important feature in CarbonData, for every new customer > we > > will introduce this feature to him. So for new customer will know it > > clearly, will set the dictionary column when create table. > > 2. For all customer like bank customer, telecom customer and traffic > > customer have a same scenario is: have more column but only set few > column > > as dictionary. > > like telecom customer, 300 column only set 5 column dictionary, other > > dim don't set dictionary. > > like bank customer, 100 column only set about 5 column dictionary, > > other > > dim don't set dictionary. > > *For currently customer actually user scenario, they only set the dim > which > > used for filter and group by related column as dictionary* > > 3. mys suggestion is that: dim column default as no dictionary is only > for > > the dim which not put into the dictionary_include properties, not for all > > dim column. If customer always used 5 columns add into dictionary_include > > and others column no dictionary, this will not impact the query > > performance. > > > > So that I suggestion the dim column default set as no dictionary which > not > > added in to dictionary_include properties. > > > > Regards > > Bill > > > > > > > > kumarvishal09 wrote > > > Hi, > > > I completely agree with Ravindra's points, more number of no > > > dictionary > > > column will impact the IO reading+writing both as in case of no > > dictionary > > > data size will increase. Late decoding is one of main advantage, no > > > dictionary column aggregation will be slower. Filter query will suffer > as > > > in case of dictionary column we are comparing on byte pack value, in > case > > > of no dictionary it will be on actual value. > > > > > > -Regards > > > Kumar Vishal > > > > > > On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala < > > > > > ravi.pesala@ > > > > > > > > > wrote: > > > > > >> Hi, > > >> > > >> I feel there are more disadvantages than advantages in this approach. > In > > >> your current scenario you want to set dictionary only for columns > which > > >> are > > >> used as filters, but the usage of dictionary is not only limited for > > >> filters, it can reduce the store size and improve the aggregation > > >> queries. > > >> I think you should set no_inverted_index false on non filtered columns > > to > > >> reduce the store size and improve the performance. > > >> > > >> If we make no dictionary as default then user no need set them in DDL > > but > > >> user needs to set the dictionary columns. If user wants to set more > > >> dictionary columns then the same problem what you mentioned arises > again > > >> so > > >> it does not solve the problem. I feel we should give more flexibility > in > > >> our DDL to simplify these scenarios and we should have more discussion > > on > > >> it. > > >> > > >> Pros & Cons of your suggestion. > > >> Advantages : > > >> 1. Decoding/Encoding of dictionary could be avoided. > > >> > > >> Disadvantages : > > >> 1. Store size will increase drastically. > > >> 2. IO will increase so query performance will come down. > > >> 3. Aggregation queries performance will suffer. > > >> > > >> > > >> > > >> Regards, > > >> Ravindra. > > >> > > >> On 26 February 2017 at 20:04, bill.zhou < > > > > > zgcsky08@ > > > > > > wrote: > > >> > > >> > hi All > > >> > Now when create the CarbonData table,if the dimension don't add > > >> into > > >> > the dictionary_exclude properties, the dimension will be consider as > > >> > dictionary default. I think default should be no dictionary. > > >> > > > >> > For example when I do the POC for one customer, it has 300 > columns > > >> and > > >> > 200 dimensions, but only 5 columns is used for filter, so he only > need > > >> set > > >> > this 5 columns to dictionary and leave other 195 columns to no > > >> dictionary. > > >> > But now he need specify for the 195 columns to dictionary_exclude > > >> > properties > > >> > the will waste time and make the create table command huge, also > will > > >> > impact > > >> > the load performance. > > >> > > > >> > So I suggestion dimension default should be no dictionary and > this > > >> can > > >> > also help customer easy to know the dictionary column which is > useful. > > >> > > > >> > > > >> > > > >> > -- > > >> > View this message in context: http://apache-carbondata- > > >> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > > >> > dimension-default-should-be-no-dictionary-tp8010.html > > >> > Sent from the Apache CarbonData Mailing List archive mailing list > > >> archive > > >> > at Nabble.com. > > >> > > > >> > > >> > > >> > > >> -- > > >> Thanks & Regards, > > >> Ravi > > >> > > > > > > > > > > > > -- > > View this message in context: http://apache-carbondata- > > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > > dimension-default-should-be-no-dictionary-tp8010p8027.html > > Sent from the Apache CarbonData Mailing List archive mailing list archive > > at Nabble.com. > > > > > > -- > Thanks & Regards, > Ravi > -- Regards Liang |
In reply to this post by ravipesala
Yes, first we should simplify the DDL options. I propose following options, please check weather it miss some scenario.
1. SORT_COLUMNS, or SORT_KEY This indicates three things: 1) All columns specified in options will be used to construct Multi-Dimensional Key, which will be sorted along this key 2) They will be encoded as Inverted Index and thus again sorted within column chunk in one blocklet 3) Minmax index will also be created for these columns When to use: This option is designed for accelerating filter query, so put all filter columns into this option. The order of it can be: 1) From low cardinality to high cardinality, this will make most compression and fit for scenario that does not have frequent filter on high card column 2) Put high cardinality column first, then put others. This fits for frequent filter on high card column For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as Inverted Index and with Minmax Index Note that while C1,C2,C3 can be dimension but they also can be measure. So if user need to filter on measure column, it can be put in SORT_COLUMNS option. If this option is not specified by user, carbon will pick MDK as it is now. 2. TABLE_DICTIONARY This is to specify the table level dictionary columns. Will create global dictionary for all columns in this option for every data load. When to use: The option is designed for accelerating aggregate query, so put group by columns into this option For example. TABLE_DICTIONARY=“C2,C3,C5” If this option is not specified by user, means all columns encoding without global dictionary support. Normal shuffle on decoded value will be applied when doing group by operation. I think these two options should be the basic option for normal user, the goal of them is to satisfy the most scenario without deep tuning of the table For advanced user who want to do deep tuning, we can debate to add more options. But we need to identify what scenario is not satisfied by using these two options first. Regards, Jacky |
Administrator
|
Hi
A couple of questions: 1) For SORT_KEY option: only build "MDK index, inverted index, minmax index" for these columns which be specified into the option(SORT_KEY) ? 2) If users don't specify TABLE_DICTIONARY, then all columns don't make dictionary encoding, and all shuffle operations are based on fact value, is my understanding right ? ------------------------------------------------------------------------------------------------------- If this option is not specified by user, means all columns encoding without global dictionary support. Normal shuffle on decoded value will be applied when doing group by operation. 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", supposed if "C2" be specified into SORT_KEY, but not be specified into TABLE_DICTIONARY, then system how to handle this case ? ----------------------------------------------------------------------------------------------------------- For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as Inverted Index and with Minmax Index Regards Liang 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: > Yes, first we should simplify the DDL options. I propose following options, > please check weather it miss some scenario. > > 1. SORT_COLUMNS, or SORT_KEY > This indicates three things: > 1) All columns specified in options will be used to construct > Multi-Dimensional Key, which will be sorted along this key > 2) They will be encoded as Inverted Index and thus again sorted within > column chunk in one blocklet > 3) Minmax index will also be created for these columns > > When to use: This option is designed for accelerating filter query, so put > all filter columns into this option. The order of it can be: > 1) From low cardinality to high cardinality, this will make most > compression > and fit for scenario that does not have frequent filter on high card column > 2) Put high cardinality column first, then put others. This fits for > frequent filter on high card column > > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as > Inverted Index and with Minmax Index > Note that while C1,C2,C3 can be dimension but they also can be measure. So > if user need to filter on measure column, it can be put in SORT_COLUMNS > option. > > If this option is not specified by user, carbon will pick MDK as it is now. > > 2. TABLE_DICTIONARY > This is to specify the table level dictionary columns. Will create global > dictionary for all columns in this option for every data load. > > When to use: The option is designed for accelerating aggregate query, so > put > group by columns into this option > > For example. TABLE_DICTIONARY=“C2,C3,C5” > > If this option is not specified by user, means all columns encoding without > global dictionary support. Normal shuffle on decoded value will be applied > when doing group by operation. > > I think these two options should be the basic option for normal user, the > goal of them is to satisfy the most scenario without deep tuning of the > table > For advanced user who want to do deep tuning, we can debate to add more > options. But we need to identify what scenario is not satisfied by using > these two options first. > > Regards, > Jacky > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > dimension-default-should-be-no-dictionary-tp8010p8081.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > -- Regards Liang |
In reply to this post by ravipesala
hi Ravindra
That is a good idea to conside the sort column and dictioanry column together. For the DDL usability I have following suggestion. please share your suggestion 1. sort columns properties better keep the same style like dictionary. so the key word suggestion changed to SORT_INCLUDE and SORT_EXECLUDE 2. The user may be confusion if the DICTIONARY_EXCLUDE= 'ALL' and DICTIONARY_INCLUDE='C3' come together. 3.the value in the sort and dictioanry properties better only allow column If allowed DICTIONARY_EXCLUDE= 'ALL', the "ALL" may be conflict with actually table column name. So I think the key point is how conside the default value which don't set in INCLUDE or EXECLUDE. because for end user, if he put the column in INCLUDE or EXECLUDE, that means this column is important and concered for user. So my suggestion as following: add one more properties called xxx_DEFAULT For example we have 6 columns , we can mention DDL as below. case 1 : SORT_INCLUDE="C1,C2,C3" SORT_EXCLUDE="C4,C5,C6" In above case C1, C2 , C3 are sort columns and part of MDK key. And C4,C5,C6 are become non sort columns(measure/complex) DICTIONARY_DEFAULT= 'EXECLUDE' DICTIONARY_INCLUDE='C3' In the above case all sort columns((C1,C2,C3) are non-dictionary columns except C3, here C3 is dictionary column. case 2: SORT_DEFAULT="INCLUDE" SORT_EXCLUDE="C6" In this case all columns are sort columns except C6. DICTIONARY_EXCLUDE= 'C2' DICTIONARY_DEFAULT='INCLUDE' In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns except C2, here C2 is no-dictionary column.
|
In reply to this post by Liang Chen
Hi Likun,
You mentioned that if user does not specify dictionary columns then by default those are chosen as no dictionary columns. But we have many disadvantages as I mentioned in above mail if you keep no dictionary as default. We have initially introduced no dictionary columns to handle high cardinality dimensions, but now making every thing as no dictionary columns by default looses our unique feature compare to parquet. Dictionary columns are introduced not only for aggregation queries, it is for better compression and better filter queries as well. With out dictionary store size will be increased a lot. Regards, Ravindra. On 28 February 2017 at 18:05, Liang Chen <[hidden email]> wrote: > Hi > > A couple of questions: > > 1) For SORT_KEY option: only build "MDK index, inverted index, minmax > index" for these columns which be specified into the option(SORT_KEY) ? > > 2) If users don't specify TABLE_DICTIONARY, then all columns don't make > dictionary encoding, and all shuffle operations are based on fact value, is > my understanding right ? > ------------------------------------------------------------ > ------------------------------------------- > If this option is not specified by user, means all columns encoding without > global dictionary support. Normal shuffle on decoded value will be applied > when doing group by operation. > > 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", > supposed if "C2" be specified into SORT_KEY, but not be specified into > TABLE_DICTIONARY, then system how to handle this case ? > ------------------------------------------------------------ > ----------------------------------------------- > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as > Inverted Index and with Minmax Index > > Regards > Liang > > 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: > > > Yes, first we should simplify the DDL options. I propose following > options, > > please check weather it miss some scenario. > > > > 1. SORT_COLUMNS, or SORT_KEY > > This indicates three things: > > 1) All columns specified in options will be used to construct > > Multi-Dimensional Key, which will be sorted along this key > > 2) They will be encoded as Inverted Index and thus again sorted within > > column chunk in one blocklet > > 3) Minmax index will also be created for these columns > > > > When to use: This option is designed for accelerating filter query, so > put > > all filter columns into this option. The order of it can be: > > 1) From low cardinality to high cardinality, this will make most > > compression > > and fit for scenario that does not have frequent filter on high card > column > > 2) Put high cardinality column first, then put others. This fits for > > frequent filter on high card column > > > > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded > as > > Inverted Index and with Minmax Index > > Note that while C1,C2,C3 can be dimension but they also can be measure. > So > > if user need to filter on measure column, it can be put in SORT_COLUMNS > > option. > > > > If this option is not specified by user, carbon will pick MDK as it is > now. > > > > 2. TABLE_DICTIONARY > > This is to specify the table level dictionary columns. Will create global > > dictionary for all columns in this option for every data load. > > > > When to use: The option is designed for accelerating aggregate query, so > > put > > group by columns into this option > > > > For example. TABLE_DICTIONARY=“C2,C3,C5” > > > > If this option is not specified by user, means all columns encoding > without > > global dictionary support. Normal shuffle on decoded value will be > applied > > when doing group by operation. > > > > I think these two options should be the basic option for normal user, the > > goal of them is to satisfy the most scenario without deep tuning of the > > table > > For advanced user who want to do deep tuning, we can debate to add more > > options. But we need to identify what scenario is not satisfied by using > > these two options first. > > > > Regards, > > Jacky > > > > > > > > -- > > View this message in context: http://apache-carbondata- > > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > > dimension-default-should-be-no-dictionary-tp8010p8081.html > > Sent from the Apache CarbonData Mailing List archive mailing list archive > > at Nabble.com. > > > > > > -- > Regards > Liang > -- Thanks & Regards, Ravi |
In reply to this post by Liang Chen
> 在 2017年2月28日,下午8:35,Liang Chen <[hidden email]> 写道: > > Hi > > A couple of questions: > > 1) For SORT_KEY option: only build "MDK index, inverted index, minmax > index" for these columns which be specified into the option(SORT_KEY) ? > Yes, build MDK index, inverted index, minimax index for columns in SORT_KEY > 2) If users don't specify TABLE_DICTIONARY, then all columns don't make > dictionary encoding, and all shuffle operations are based on fact value, is > my understanding right ? > ------------------------------------------------------------------------------------------------------- > If this option is not specified by user, means all columns encoding without > global dictionary support. Normal shuffle on decoded value will be applied > when doing group by operation. > Yes > 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", > supposed if "C2" be specified into SORT_KEY, but not be specified into > TABLE_DICTIONARY, then system how to handle this case ? > ----------------------------------------------------------------------------------------------------------- > For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as > Inverted Index and with Minmax Index > Sort it using original value > Regards > Liang > > 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: > >> Yes, first we should simplify the DDL options. I propose following options, >> please check weather it miss some scenario. >> >> 1. SORT_COLUMNS, or SORT_KEY >> This indicates three things: >> 1) All columns specified in options will be used to construct >> Multi-Dimensional Key, which will be sorted along this key >> 2) They will be encoded as Inverted Index and thus again sorted within >> column chunk in one blocklet >> 3) Minmax index will also be created for these columns >> >> When to use: This option is designed for accelerating filter query, so put >> all filter columns into this option. The order of it can be: >> 1) From low cardinality to high cardinality, this will make most >> compression >> and fit for scenario that does not have frequent filter on high card column >> 2) Put high cardinality column first, then put others. This fits for >> frequent filter on high card column >> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as >> Inverted Index and with Minmax Index >> Note that while C1,C2,C3 can be dimension but they also can be measure. So >> if user need to filter on measure column, it can be put in SORT_COLUMNS >> option. >> >> If this option is not specified by user, carbon will pick MDK as it is now. >> >> 2. TABLE_DICTIONARY >> This is to specify the table level dictionary columns. Will create global >> dictionary for all columns in this option for every data load. >> >> When to use: The option is designed for accelerating aggregate query, so >> put >> group by columns into this option >> >> For example. TABLE_DICTIONARY=“C2,C3,C5” >> >> If this option is not specified by user, means all columns encoding without >> global dictionary support. Normal shuffle on decoded value will be applied >> when doing group by operation. >> >> I think these two options should be the basic option for normal user, the >> goal of them is to satisfy the most scenario without deep tuning of the >> table >> For advanced user who want to do deep tuning, we can debate to add more >> options. But we need to identify what scenario is not satisfied by using >> these two options first. >> >> Regards, >> Jacky >> >> >> >> -- >> View this message in context: http://apache-carbondata- >> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >> dimension-default-should-be-no-dictionary-tp8010p8081.html >> Sent from the Apache CarbonData Mailing List archive mailing list archive >> at Nabble.com. >> > > > -- > Regards > Liang |
In reply to this post by ravipesala
Yes, I agree to your point. The only concern I have is for loading, I have seen many users accidentally put high cardinality column into dictionary column then the loading failed because out of memory or loading very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for these columns, or they do not have a easy way to identify the high card columns. I feel preventing such misusage is important in order to encourage more users to use carbondata.
Any suggestion on solving this issue? Regards, Likun > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道: > > Hi Likun, > > You mentioned that if user does not specify dictionary columns then by > default those are chosen as no dictionary columns. > But we have many disadvantages as I mentioned in above mail if you keep no > dictionary as default. We have initially introduced no dictionary columns > to handle high cardinality dimensions, but now making every thing as no > dictionary columns by default looses our unique feature compare to parquet. > Dictionary columns are introduced not only for aggregation queries, it is > for better compression and better filter queries as well. With out > dictionary store size will be increased a lot. > > Regards, > Ravindra. > > On 28 February 2017 at 18:05, Liang Chen <[hidden email]> wrote: > >> Hi >> >> A couple of questions: >> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax >> index" for these columns which be specified into the option(SORT_KEY) ? >> >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't make >> dictionary encoding, and all shuffle operations are based on fact value, is >> my understanding right ? >> ------------------------------------------------------------ >> ------------------------------------------- >> If this option is not specified by user, means all columns encoding without >> global dictionary support. Normal shuffle on decoded value will be applied >> when doing group by operation. >> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", >> supposed if "C2" be specified into SORT_KEY, but not be specified into >> TABLE_DICTIONARY, then system how to handle this case ? >> ------------------------------------------------------------ >> ----------------------------------------------- >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as >> Inverted Index and with Minmax Index >> >> Regards >> Liang >> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: >> >>> Yes, first we should simplify the DDL options. I propose following >> options, >>> please check weather it miss some scenario. >>> >>> 1. SORT_COLUMNS, or SORT_KEY >>> This indicates three things: >>> 1) All columns specified in options will be used to construct >>> Multi-Dimensional Key, which will be sorted along this key >>> 2) They will be encoded as Inverted Index and thus again sorted within >>> column chunk in one blocklet >>> 3) Minmax index will also be created for these columns >>> >>> When to use: This option is designed for accelerating filter query, so >> put >>> all filter columns into this option. The order of it can be: >>> 1) From low cardinality to high cardinality, this will make most >>> compression >>> and fit for scenario that does not have frequent filter on high card >> column >>> 2) Put high cardinality column first, then put others. This fits for >>> frequent filter on high card column >>> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded >> as >>> Inverted Index and with Minmax Index >>> Note that while C1,C2,C3 can be dimension but they also can be measure. >> So >>> if user need to filter on measure column, it can be put in SORT_COLUMNS >>> option. >>> >>> If this option is not specified by user, carbon will pick MDK as it is >> now. >>> >>> 2. TABLE_DICTIONARY >>> This is to specify the table level dictionary columns. Will create global >>> dictionary for all columns in this option for every data load. >>> >>> When to use: The option is designed for accelerating aggregate query, so >>> put >>> group by columns into this option >>> >>> For example. TABLE_DICTIONARY=“C2,C3,C5” >>> >>> If this option is not specified by user, means all columns encoding >> without >>> global dictionary support. Normal shuffle on decoded value will be >> applied >>> when doing group by operation. >>> >>> I think these two options should be the basic option for normal user, the >>> goal of them is to satisfy the most scenario without deep tuning of the >>> table >>> For advanced user who want to do deep tuning, we can debate to add more >>> options. But we need to identify what scenario is not satisfied by using >>> these two options first. >>> >>> Regards, >>> Jacky >>> >>> >>> >>> -- >>> View this message in context: http://apache-carbondata- >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >>> dimension-default-should-be-no-dictionary-tp8010p8081.html >>> Sent from the Apache CarbonData Mailing List archive mailing list archive >>> at Nabble.com. >>> >> >> >> >> -- >> Regards >> Liang >> > > > -- > Thanks & Regards, > Ravi |
In reply to this post by Jacky Li
+1
It is not easy for user to understand the previous options. The logic of this two options SORT_COLUMNS AND TABLE_DICTIOANRY is very clear. I am coding to implement SORT_COLUMNS option by this way. Best Regards David Caiqiang
Best Regards
David Cai |
In reply to this post by Jacky Li
Hi Likun,
It would be same case if we use all non dictionary columns by default, it will increase the store size and decrease the performance so it is also does not encourage more users if performance is poor. If we need to make no-dictionary columns as default then we should first focus on reducing the store size and improve the filter queries on non-dictionary columns.Even memory usage is higher while querying the non-dictionary columns. Regards, Ravindra. On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote: > Yes, I agree to your point. The only concern I have is for loading, I have > seen many users accidentally put high cardinality column into dictionary > column then the loading failed because out of memory or loading very slow. > I guess they just do not know to use DICTIONARY_EXCLUDE for these columns, > or they do not have a easy way to identify the high card columns. I feel > preventing such misusage is important in order to encourage more users to > use carbondata. > > Any suggestion on solving this issue? > > > Regards, > Likun > > > > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道: > > > > Hi Likun, > > > > You mentioned that if user does not specify dictionary columns then by > > default those are chosen as no dictionary columns. > > But we have many disadvantages as I mentioned in above mail if you keep > no > > dictionary as default. We have initially introduced no dictionary columns > > to handle high cardinality dimensions, but now making every thing as no > > dictionary columns by default looses our unique feature compare to > parquet. > > Dictionary columns are introduced not only for aggregation queries, it is > > for better compression and better filter queries as well. With out > > dictionary store size will be increased a lot. > > > > Regards, > > Ravindra. > > > > On 28 February 2017 at 18:05, Liang Chen <[hidden email]> > wrote: > > > >> Hi > >> > >> A couple of questions: > >> > >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax > >> index" for these columns which be specified into the option(SORT_KEY) ? > >> > >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't make > >> dictionary encoding, and all shuffle operations are based on fact > value, is > >> my understanding right ? > >> ------------------------------------------------------------ > >> ------------------------------------------- > >> If this option is not specified by user, means all columns encoding > without > >> global dictionary support. Normal shuffle on decoded value will be > applied > >> when doing group by operation. > >> > >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", > >> supposed if "C2" be specified into SORT_KEY, but not be specified into > >> TABLE_DICTIONARY, then system how to handle this case ? > >> ------------------------------------------------------------ > >> ----------------------------------------------- > >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded > as > >> Inverted Index and with Minmax Index > >> > >> Regards > >> Liang > >> > >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: > >> > >>> Yes, first we should simplify the DDL options. I propose following > >> options, > >>> please check weather it miss some scenario. > >>> > >>> 1. SORT_COLUMNS, or SORT_KEY > >>> This indicates three things: > >>> 1) All columns specified in options will be used to construct > >>> Multi-Dimensional Key, which will be sorted along this key > >>> 2) They will be encoded as Inverted Index and thus again sorted within > >>> column chunk in one blocklet > >>> 3) Minmax index will also be created for these columns > >>> > >>> When to use: This option is designed for accelerating filter query, so > >> put > >>> all filter columns into this option. The order of it can be: > >>> 1) From low cardinality to high cardinality, this will make most > >>> compression > >>> and fit for scenario that does not have frequent filter on high card > >> column > >>> 2) Put high cardinality column first, then put others. This fits for > >>> frequent filter on high card column > >>> > >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded > >> as > >>> Inverted Index and with Minmax Index > >>> Note that while C1,C2,C3 can be dimension but they also can be measure. > >> So > >>> if user need to filter on measure column, it can be put in SORT_COLUMNS > >>> option. > >>> > >>> If this option is not specified by user, carbon will pick MDK as it is > >> now. > >>> > >>> 2. TABLE_DICTIONARY > >>> This is to specify the table level dictionary columns. Will create > global > >>> dictionary for all columns in this option for every data load. > >>> > >>> When to use: The option is designed for accelerating aggregate query, > so > >>> put > >>> group by columns into this option > >>> > >>> For example. TABLE_DICTIONARY=“C2,C3,C5” > >>> > >>> If this option is not specified by user, means all columns encoding > >> without > >>> global dictionary support. Normal shuffle on decoded value will be > >> applied > >>> when doing group by operation. > >>> > >>> I think these two options should be the basic option for normal user, > the > >>> goal of them is to satisfy the most scenario without deep tuning of the > >>> table > >>> For advanced user who want to do deep tuning, we can debate to add more > >>> options. But we need to identify what scenario is not satisfied by > using > >>> these two options first. > >>> > >>> Regards, > >>> Jacky > >>> > >>> > >>> > >>> -- > >>> View this message in context: http://apache-carbondata- > >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > >>> dimension-default-should-be-no-dictionary-tp8010p8081.html > >>> Sent from the Apache CarbonData Mailing List archive mailing list > archive > >>> at Nabble.com. > >>> > >> > >> > >> > >> -- > >> Regards > >> Liang > >> > > > > > > -- > > Thanks & Regards, > > Ravi > > > > -- Thanks & Regards, Ravi |
Hi Jacky,
I agree with Ravindra's point by making no dictionary column by default will increase the store size and it will impact IO+ currently in carbon for no dictionary column only String data type is supported, so we cannot set dimension column as no dictionary column by default. -Regards Kumar Vishal On Wed, Mar 1, 2017 at 12:42 PM, Ravindra Pesala <[hidden email]> wrote: > Hi Likun, > > It would be same case if we use all non dictionary columns by default, it > will increase the store size and decrease the performance so it is also > does not encourage more users if performance is poor. > > If we need to make no-dictionary columns as default then we should first > focus on reducing the store size and improve the filter queries on > non-dictionary columns.Even memory usage is higher while querying the > non-dictionary columns. > > Regards, > Ravindra. > > On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote: > > > Yes, I agree to your point. The only concern I have is for loading, I > have > > seen many users accidentally put high cardinality column into dictionary > > column then the loading failed because out of memory or loading very > slow. > > I guess they just do not know to use DICTIONARY_EXCLUDE for these > columns, > > or they do not have a easy way to identify the high card columns. I feel > > preventing such misusage is important in order to encourage more users to > > use carbondata. > > > > Any suggestion on solving this issue? > > > > > > Regards, > > Likun > > > > > > > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道: > > > > > > Hi Likun, > > > > > > You mentioned that if user does not specify dictionary columns then by > > > default those are chosen as no dictionary columns. > > > But we have many disadvantages as I mentioned in above mail if you keep > > no > > > dictionary as default. We have initially introduced no dictionary > columns > > > to handle high cardinality dimensions, but now making every thing as no > > > dictionary columns by default looses our unique feature compare to > > parquet. > > > Dictionary columns are introduced not only for aggregation queries, it > is > > > for better compression and better filter queries as well. With out > > > dictionary store size will be increased a lot. > > > > > > Regards, > > > Ravindra. > > > > > > On 28 February 2017 at 18:05, Liang Chen <[hidden email]> > > wrote: > > > > > >> Hi > > >> > > >> A couple of questions: > > >> > > >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax > > >> index" for these columns which be specified into the > option(SORT_KEY) ? > > >> > > >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't > make > > >> dictionary encoding, and all shuffle operations are based on fact > > value, is > > >> my understanding right ? > > >> ------------------------------------------------------------ > > >> ------------------------------------------- > > >> If this option is not specified by user, means all columns encoding > > without > > >> global dictionary support. Normal shuffle on decoded value will be > > applied > > >> when doing group by operation. > > >> > > >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", > > >> supposed if "C2" be specified into SORT_KEY, but not be specified > into > > >> TABLE_DICTIONARY, then system how to handle this case ? > > >> ------------------------------------------------------------ > > >> ----------------------------------------------- > > >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and > encoded > > as > > >> Inverted Index and with Minmax Index > > >> > > >> Regards > > >> Liang > > >> > > >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: > > >> > > >>> Yes, first we should simplify the DDL options. I propose following > > >> options, > > >>> please check weather it miss some scenario. > > >>> > > >>> 1. SORT_COLUMNS, or SORT_KEY > > >>> This indicates three things: > > >>> 1) All columns specified in options will be used to construct > > >>> Multi-Dimensional Key, which will be sorted along this key > > >>> 2) They will be encoded as Inverted Index and thus again sorted > within > > >>> column chunk in one blocklet > > >>> 3) Minmax index will also be created for these columns > > >>> > > >>> When to use: This option is designed for accelerating filter query, > so > > >> put > > >>> all filter columns into this option. The order of it can be: > > >>> 1) From low cardinality to high cardinality, this will make most > > >>> compression > > >>> and fit for scenario that does not have frequent filter on high card > > >> column > > >>> 2) Put high cardinality column first, then put others. This fits for > > >>> frequent filter on high card column > > >>> > > >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and > encoded > > >> as > > >>> Inverted Index and with Minmax Index > > >>> Note that while C1,C2,C3 can be dimension but they also can be > measure. > > >> So > > >>> if user need to filter on measure column, it can be put in > SORT_COLUMNS > > >>> option. > > >>> > > >>> If this option is not specified by user, carbon will pick MDK as it > is > > >> now. > > >>> > > >>> 2. TABLE_DICTIONARY > > >>> This is to specify the table level dictionary columns. Will create > > global > > >>> dictionary for all columns in this option for every data load. > > >>> > > >>> When to use: The option is designed for accelerating aggregate query, > > so > > >>> put > > >>> group by columns into this option > > >>> > > >>> For example. TABLE_DICTIONARY=“C2,C3,C5” > > >>> > > >>> If this option is not specified by user, means all columns encoding > > >> without > > >>> global dictionary support. Normal shuffle on decoded value will be > > >> applied > > >>> when doing group by operation. > > >>> > > >>> I think these two options should be the basic option for normal user, > > the > > >>> goal of them is to satisfy the most scenario without deep tuning of > the > > >>> table > > >>> For advanced user who want to do deep tuning, we can debate to add > more > > >>> options. But we need to identify what scenario is not satisfied by > > using > > >>> these two options first. > > >>> > > >>> Regards, > > >>> Jacky > > >>> > > >>> > > >>> > > >>> -- > > >>> View this message in context: http://apache-carbondata- > > >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- > > >>> dimension-default-should-be-no-dictionary-tp8010p8081.html > > >>> Sent from the Apache CarbonData Mailing List archive mailing list > > archive > > >>> at Nabble.com. > > >>> > > >> > > >> > > >> > > >> -- > > >> Regards > > >> Liang > > >> > > > > > > > > > -- > > > Thanks & Regards, > > > Ravi > > > > > > > > > > > -- > Thanks & Regards, > Ravi >
kumar vishal
|
In reply to this post by ravipesala
Hi All,
In order to make no-dictionary columns as default we should improve the storage and performance for these columns. I have sent another mail to discuss the improvement points. Please comment on it. Regards, Ravindra On 1 March 2017 at 10:12, Ravindra Pesala <[hidden email]> wrote: > Hi Likun, > > It would be same case if we use all non dictionary columns by default, it > will increase the store size and decrease the performance so it is also > does not encourage more users if performance is poor. > > If we need to make no-dictionary columns as default then we should first > focus on reducing the store size and improve the filter queries on > non-dictionary columns.Even memory usage is higher while querying the > non-dictionary columns. > > Regards, > Ravindra. > > On 1 March 2017 at 06:00, Jacky Li <[hidden email]> wrote: > >> Yes, I agree to your point. The only concern I have is for loading, I >> have seen many users accidentally put high cardinality column into >> dictionary column then the loading failed because out of memory or loading >> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for >> these columns, or they do not have a easy way to identify the high card >> columns. I feel preventing such misusage is important in order to encourage >> more users to use carbondata. >> >> Any suggestion on solving this issue? >> >> >> Regards, >> Likun >> >> >> > 在 2017年2月28日,下午10:20,Ravindra Pesala <[hidden email]> 写道: >> > >> > Hi Likun, >> > >> > You mentioned that if user does not specify dictionary columns then by >> > default those are chosen as no dictionary columns. >> > But we have many disadvantages as I mentioned in above mail if you keep >> no >> > dictionary as default. We have initially introduced no dictionary >> columns >> > to handle high cardinality dimensions, but now making every thing as no >> > dictionary columns by default looses our unique feature compare to >> parquet. >> > Dictionary columns are introduced not only for aggregation queries, it >> is >> > for better compression and better filter queries as well. With out >> > dictionary store size will be increased a lot. >> > >> > Regards, >> > Ravindra. >> > >> > On 28 February 2017 at 18:05, Liang Chen <[hidden email]> >> wrote: >> > >> >> Hi >> >> >> >> A couple of questions: >> >> >> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax >> >> index" for these columns which be specified into the option(SORT_KEY) >> ? >> >> >> >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't >> make >> >> dictionary encoding, and all shuffle operations are based on fact >> value, is >> >> my understanding right ? >> >> ------------------------------------------------------------ >> >> ------------------------------------------- >> >> If this option is not specified by user, means all columns encoding >> without >> >> global dictionary support. Normal shuffle on decoded value will be >> applied >> >> when doing group by operation. >> >> >> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", >> >> supposed if "C2" be specified into SORT_KEY, but not be specified into >> >> TABLE_DICTIONARY, then system how to handle this case ? >> >> ------------------------------------------------------------ >> >> ----------------------------------------------- >> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >> encoded as >> >> Inverted Index and with Minmax Index >> >> >> >> Regards >> >> Liang >> >> >> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>: >> >> >> >>> Yes, first we should simplify the DDL options. I propose following >> >> options, >> >>> please check weather it miss some scenario. >> >>> >> >>> 1. SORT_COLUMNS, or SORT_KEY >> >>> This indicates three things: >> >>> 1) All columns specified in options will be used to construct >> >>> Multi-Dimensional Key, which will be sorted along this key >> >>> 2) They will be encoded as Inverted Index and thus again sorted within >> >>> column chunk in one blocklet >> >>> 3) Minmax index will also be created for these columns >> >>> >> >>> When to use: This option is designed for accelerating filter query, so >> >> put >> >>> all filter columns into this option. The order of it can be: >> >>> 1) From low cardinality to high cardinality, this will make most >> >>> compression >> >>> and fit for scenario that does not have frequent filter on high card >> >> column >> >>> 2) Put high cardinality column first, then put others. This fits for >> >>> frequent filter on high card column >> >>> >> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >> encoded >> >> as >> >>> Inverted Index and with Minmax Index >> >>> Note that while C1,C2,C3 can be dimension but they also can be >> measure. >> >> So >> >>> if user need to filter on measure column, it can be put in >> SORT_COLUMNS >> >>> option. >> >>> >> >>> If this option is not specified by user, carbon will pick MDK as it is >> >> now. >> >>> >> >>> 2. TABLE_DICTIONARY >> >>> This is to specify the table level dictionary columns. Will create >> global >> >>> dictionary for all columns in this option for every data load. >> >>> >> >>> When to use: The option is designed for accelerating aggregate query, >> so >> >>> put >> >>> group by columns into this option >> >>> >> >>> For example. TABLE_DICTIONARY=“C2,C3,C5” >> >>> >> >>> If this option is not specified by user, means all columns encoding >> >> without >> >>> global dictionary support. Normal shuffle on decoded value will be >> >> applied >> >>> when doing group by operation. >> >>> >> >>> I think these two options should be the basic option for normal user, >> the >> >>> goal of them is to satisfy the most scenario without deep tuning of >> the >> >>> table >> >>> For advanced user who want to do deep tuning, we can debate to add >> more >> >>> options. But we need to identify what scenario is not satisfied by >> using >> >>> these two options first. >> >>> >> >>> Regards, >> >>> Jacky >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: http://apache-carbondata- >> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html >> >>> Sent from the Apache CarbonData Mailing List archive mailing list >> archive >> >>> at Nabble.com. >> >>> >> >> >> >> >> >> >> >> -- >> >> Regards >> >> Liang >> >> >> > >> > >> > -- >> > Thanks & Regards, >> > Ravi >> >> >> >> > > > -- > Thanks & Regards, > Ravi > -- Thanks & Regards, Ravi |
hi All
I summary this discussion. 1. to make carbonData compatibility for older vesion, keep DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not suggestion change this properties to table_dictionary. 2. Suggestion keep the sort_column properties as the same style for dictionary. so this new properties suggestion use SORT_INCLUDE and SORT_EXCLUDE, default is no sort. Regards Bill
|
Hi Bill,
1. I think Ravindra and Vishal’s point is valid, we should keep default is dictionary before we have improved performance of no-dictionary column. We are discussing this in another thread in mail list. 2. For sorting, default should be carbon’s current behavior (picking dimension according to default rule automatically as the MDK). If user specify SORT_COLUMNS, then use it. I think SORT_EXCLUDE is not required. Regards, Jacky > 在 2017年3月3日,上午12:22,bill.zhou <[hidden email]> 写道: > > hi All > I summary this discussion. > 1. to make carbonData compatibility for older vesion, keep > DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not > suggestion change this properties to table_dictionary. > 2. Suggestion keep the sort_column properties as the same style for > dictionary. so this new properties suggestion use SORT_INCLUDE and > SORT_EXCLUDE, default is no sort. > > Regards > Bill > > > ravipesala wrote >> Hi All, >> >> In order to make no-dictionary columns as default we should improve the >> storage and performance for these columns. I have sent another mail to >> discuss the improvement points. Please comment on it. >> >> Regards, >> Ravindra >> >> On 1 March 2017 at 10:12, Ravindra Pesala < > >> ravi.pesala@ > >> > wrote: >> >>> Hi Likun, >>> >>> It would be same case if we use all non dictionary columns by default, it >>> will increase the store size and decrease the performance so it is also >>> does not encourage more users if performance is poor. >>> >>> If we need to make no-dictionary columns as default then we should first >>> focus on reducing the store size and improve the filter queries on >>> non-dictionary columns.Even memory usage is higher while querying the >>> non-dictionary columns. >>> >>> Regards, >>> Ravindra. >>> >>> On 1 March 2017 at 06:00, Jacky Li < > >> jacky.likun@ > >> > wrote: >>> >>>> Yes, I agree to your point. The only concern I have is for loading, I >>>> have seen many users accidentally put high cardinality column into >>>> dictionary column then the loading failed because out of memory or >>>> loading >>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for >>>> these columns, or they do not have a easy way to identify the high card >>>> columns. I feel preventing such misusage is important in order to >>>> encourage >>>> more users to use carbondata. >>>> >>>> Any suggestion on solving this issue? >>>> >>>> >>>> Regards, >>>> Likun >>>> >>>> >>>>> 在 2017年2月28日,下午10:20,Ravindra Pesala < > >> ravi.pesala@ > >> > 写道: >>>>> >>>>> Hi Likun, >>>>> >>>>> You mentioned that if user does not specify dictionary columns then by >>>>> default those are chosen as no dictionary columns. >>>>> But we have many disadvantages as I mentioned in above mail if you >>>> keep >>>> no >>>>> dictionary as default. We have initially introduced no dictionary >>>> columns >>>>> to handle high cardinality dimensions, but now making every thing as >>>> no >>>>> dictionary columns by default looses our unique feature compare to >>>> parquet. >>>>> Dictionary columns are introduced not only for aggregation queries, it >>>> is >>>>> for better compression and better filter queries as well. With out >>>>> dictionary store size will be increased a lot. >>>>> >>>>> Regards, >>>>> Ravindra. >>>>> >>>>> On 28 February 2017 at 18:05, Liang Chen < > >> chenliang6136@ > >> > >>>> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> A couple of questions: >>>>>> >>>>>> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax >>>>>> index" for these columns which be specified into the option(SORT_KEY) >>>> ? >>>>>> >>>>>> 2) If users don't specify TABLE_DICTIONARY, then all columns don't >>>> make >>>>>> dictionary encoding, and all shuffle operations are based on fact >>>> value, is >>>>>> my understanding right ? >>>>>> ------------------------------------------------------------ >>>>>> ------------------------------------------- >>>>>> If this option is not specified by user, means all columns encoding >>>> without >>>>>> global dictionary support. Normal shuffle on decoded value will be >>>> applied >>>>>> when doing group by operation. >>>>>> >>>>>> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", >>>>>> supposed if "C2" be specified into SORT_KEY, but not be specified >>>> into >>>>>> TABLE_DICTIONARY, then system how to handle this case ? >>>>>> ------------------------------------------------------------ >>>>>> ----------------------------------------------- >>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >>>> encoded as >>>>>> Inverted Index and with Minmax Index >>>>>> >>>>>> Regards >>>>>> Liang >>>>>> >>>>>> 2017-02-28 19:35 GMT+08:00 Jacky Li < > >> jacky.likun@ > >> >: >>>>>> >>>>>>> Yes, first we should simplify the DDL options. I propose following >>>>>> options, >>>>>>> please check weather it miss some scenario. >>>>>>> >>>>>>> 1. SORT_COLUMNS, or SORT_KEY >>>>>>> This indicates three things: >>>>>>> 1) All columns specified in options will be used to construct >>>>>>> Multi-Dimensional Key, which will be sorted along this key >>>>>>> 2) They will be encoded as Inverted Index and thus again sorted >>>> within >>>>>>> column chunk in one blocklet >>>>>>> 3) Minmax index will also be created for these columns >>>>>>> >>>>>>> When to use: This option is designed for accelerating filter query, >>>> so >>>>>> put >>>>>>> all filter columns into this option. The order of it can be: >>>>>>> 1) From low cardinality to high cardinality, this will make most >>>>>>> compression >>>>>>> and fit for scenario that does not have frequent filter on high card >>>>>> column >>>>>>> 2) Put high cardinality column first, then put others. This fits for >>>>>>> frequent filter on high card column >>>>>>> >>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >>>> encoded >>>>>> as >>>>>>> Inverted Index and with Minmax Index >>>>>>> Note that while C1,C2,C3 can be dimension but they also can be >>>> measure. >>>>>> So >>>>>>> if user need to filter on measure column, it can be put in >>>> SORT_COLUMNS >>>>>>> option. >>>>>>> >>>>>>> If this option is not specified by user, carbon will pick MDK as it >>>> is >>>>>> now. >>>>>>> >>>>>>> 2. TABLE_DICTIONARY >>>>>>> This is to specify the table level dictionary columns. Will create >>>> global >>>>>>> dictionary for all columns in this option for every data load. >>>>>>> >>>>>>> When to use: The option is designed for accelerating aggregate >>>> query, >>>> so >>>>>>> put >>>>>>> group by columns into this option >>>>>>> >>>>>>> For example. TABLE_DICTIONARY=“C2,C3,C5” >>>>>>> >>>>>>> If this option is not specified by user, means all columns encoding >>>>>> without >>>>>>> global dictionary support. Normal shuffle on decoded value will be >>>>>> applied >>>>>>> when doing group by operation. >>>>>>> >>>>>>> I think these two options should be the basic option for normal >>>> user, >>>> the >>>>>>> goal of them is to satisfy the most scenario without deep tuning of >>>> the >>>>>>> table >>>>>>> For advanced user who want to do deep tuning, we can debate to add >>>> more >>>>>>> options. But we need to identify what scenario is not satisfied by >>>> using >>>>>>> these two options first. >>>>>>> >>>>>>> Regards, >>>>>>> Jacky >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: http://apache-carbondata- >>>>>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >>>>>>> dimension-default-should-be-no-dictionary-tp8010p8081.html >>>>>>> Sent from the Apache CarbonData Mailing List archive mailing list >>>> archive >>>>>>> at Nabble.com. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> Liang >>>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Ravi >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Thanks & Regards, >>> Ravi >>> >> >> >> >> -- >> Thanks & Regards, >> Ravi > > > > > -- > View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html> > Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com <http://nabble.com/>. |
Free forum by Nabble | Edit this page |