Login  Register

Re: [DISCUSS] For the dimension default should be no dictionary

Posted by Jacky Li on Mar 01, 2017; 12:18am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8120.html


> 在 2017年2月28日,下午8:35,Liang Chen <[hidden email]> 写道:
>
> Hi
>
> A couple of questions:
>
> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> index" for these columns which be specified into the option(SORT_KEY)  ?
>
Yes, build MDK index, inverted index, minimax index for columns in SORT_KEY

> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't make
> dictionary encoding, and all shuffle operations are based on fact value, is
> my understanding right ?
> -------------------------------------------------------------------------------------------------------
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
>
Yes

> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> supposed  if "C2" be specified into SORT_KEY, but not be specified into
> TABLE_DICTIONARY, then system how to handle this case ?
> -----------------------------------------------------------------------------------------------------------
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
>
Sort it using original value

> Regards
> Liang
>
> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>
>> Yes, first we should simplify the DDL options. I propose following options,
>> please check weather it miss some scenario.
>>
>> 1. SORT_COLUMNS, or SORT_KEY
>> This indicates three things:
>> 1) All columns specified in options will be used to construct
>> Multi-Dimensional Key, which will be sorted along this key
>> 2) They will be encoded as Inverted Index and thus again sorted within
>> column chunk in one blocklet
>> 3) Minmax index will also be created for these columns
>>
>> When to use: This option is designed for accelerating filter query, so put
>> all filter columns into this option. The order of it can be:
>> 1) From low cardinality to high cardinality, this will make most
>> compression
>> and fit for scenario that does not have frequent filter on high card column
>> 2) Put high cardinality column first, then put others. This fits for
>> frequent filter on high card column
>>
>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
>> Inverted Index and with Minmax Index
>> Note that while C1,C2,C3 can be dimension but they also can be measure. So
>> if user need to filter on measure column, it can be put in SORT_COLUMNS
>> option.
>>
>> If this option is not specified by user, carbon will pick MDK as it is now.
>>
>> 2. TABLE_DICTIONARY
>> This is to specify the table level dictionary columns. Will create global
>> dictionary for all columns in this option for every data load.
>>
>> When to use: The option is designed for accelerating aggregate query, so
>> put
>> group by columns into this option
>>
>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>
>> If this option is not specified by user, means all columns encoding without
>> global dictionary support. Normal shuffle on decoded value will be applied
>> when doing group by operation.
>>
>> I think these two options should be the basic option for normal user, the
>> goal of them is to satisfy the most scenario without deep tuning of the
>> table
>> For advanced user who want to do deep tuning, we can debate to add more
>> options. But we need to identify what scenario is not satisfied by using
>> these two options first.
>>
>> Regards,
>> Jacky
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
> --
> Regards
> Liang