Apache CarbonData Dev Mailing List archive - Re: [DISCUSS] For the dimension default should be no dictionary

Apache CarbonData Dev Mailing List archive

Re: [DISCUSS] For the dimension default should be no dictionary

Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8121.html

Yes, I agree to your point. The only concern I have is for loading, I have seen many users accidentally put high cardinality column into dictionary column then the loading failed because out of memory or loading very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for these columns, or they do not have a easy way to identify the high card columns. I feel preventing such misusage is important in order to encourage more users to use carbondata.

Any suggestion on solving this issue?

Regards,
Likun

> 在 2017年2月28日，下午10:20，Ravindra Pesala <[hidden email]> 写道：
>
> Hi Likun,
>
> You mentioned that if user does not specify dictionary columns then by
> default those are chosen as no dictionary columns.
> But we have many disadvantages as I mentioned in above mail if you keep no
> dictionary as default. We have initially introduced no dictionary columns
> to handle high cardinality dimensions, but now making every thing as no
> dictionary columns by default looses our unique feature compare to parquet.
> Dictionary columns are introduced not only for aggregation queries, it is
> for better compression and better filter queries as well. With out
> dictionary store size will be increased a lot.
>
> Regards,
> Ravindra.
>
> On 28 February 2017 at 18:05, Liang Chen <[hidden email]> wrote:
>
>> Hi
>>
>> A couple of questions:
>>
>> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>> index" for these columns which be specified into the option(SORT_KEY) ?
>>
>> 2) If users don't specify TABLE_DICTIONARY, then all columns don't make
>> dictionary encoding, and all shuffle operations are based on fact value, is
>> my understanding right ?
>> ------------------------------------------------------------
>> -------------------------------------------
>> If this option is not specified by user, means all columns encoding without
>> global dictionary support. Normal shuffle on decoded value will be applied
>> when doing group by operation.
>>
>> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>> supposed if "C2" be specified into SORT_KEY, but not be specified into
>> TABLE_DICTIONARY, then system how to handle this case ?
>> ------------------------------------------------------------
>> -----------------------------------------------
>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
>> Inverted Index and with Minmax Index
>>
>> Regards
>> Liang
>>
>> 2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:
>>
>>> Yes, first we should simplify the DDL options. I propose following
>> options,
>>> please check weather it miss some scenario.
>>>
>>> 1. SORT_COLUMNS, or SORT_KEY
>>> This indicates three things:
>>> 1) All columns specified in options will be used to construct
>>> Multi-Dimensional Key, which will be sorted along this key
>>> 2) They will be encoded as Inverted Index and thus again sorted within
>>> column chunk in one blocklet
>>> 3) Minmax index will also be created for these columns
>>>
>>> When to use: This option is designed for accelerating filter query, so
>> put
>>> all filter columns into this option. The order of it can be:
>>> 1) From low cardinality to high cardinality, this will make most
>>> compression
>>> and fit for scenario that does not have frequent filter on high card
>> column
>>> 2) Put high cardinality column first, then put others. This fits for
>>> frequent filter on high card column
>>>
>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
>> as
>>> Inverted Index and with Minmax Index
>>> Note that while C1,C2,C3 can be dimension but they also can be measure.
>> So
>>> if user need to filter on measure column, it can be put in SORT_COLUMNS
>>> option.
>>>
>>> If this option is not specified by user, carbon will pick MDK as it is
>> now.
>>>
>>> 2. TABLE_DICTIONARY
>>> This is to specify the table level dictionary columns. Will create global
>>> dictionary for all columns in this option for every data load.
>>>
>>> When to use: The option is designed for accelerating aggregate query, so
>>> put
>>> group by columns into this option
>>>
>>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>>
>>> If this option is not specified by user, means all columns encoding
>> without
>>> global dictionary support. Normal shuffle on decoded value will be
>> applied
>>> when doing group by operation.
>>>
>>> I think these two options should be the basic option for normal user, the
>>> goal of them is to satisfy the most scenario without deep tuning of the
>>> table
>>> For advanced user who want to do deep tuning, we can debate to add more
>>> options. But we need to identify what scenario is not satisfied by using
>>> these two options first.
>>>
>>> Regards,
>>> Jacky
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-carbondata-
>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>>> at Nabble.com.
>>>
>>
>>
>>
>> --
>> Regards
>> Liang
>>
>
>
> --
> Thanks & Regards,
> Ravi