Apache CarbonData Dev Mailing List archive - Re: [DISCUSS] For the dimension default should be no dictionary

Apache CarbonData Dev Mailing List archive

Re: [DISCUSS] For the dimension default should be no dictionary

Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8214.html

Hi Bill,

1. I think Ravindra and Vishal’s point is valid, we should keep default is dictionary before we have improved performance of no-dictionary column.
We are discussing this in another thread in mail list.

2. For sorting, default should be carbon’s current behavior (picking dimension according to default rule automatically as the MDK). If user specify SORT_COLUMNS, then use it. I think SORT_EXCLUDE is not required.

Regards,
Jacky

> 在 2017年3月3日，上午12:22，bill.zhou <[hidden email]> 写道：
>
> hi All
> I summary this discussion.
> 1. to make carbonData compatibility for older vesion, keep
> DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not
> suggestion change this properties to table_dictionary.
> 2. Suggestion keep the sort_column properties as the same style for
> dictionary. so this new properties suggestion use SORT_INCLUDE and
> SORT_EXCLUDE, default is no sort.
>
> Regards
> Bill
>
>
> ravipesala wrote
>> Hi All,
>>
>> In order to make no-dictionary columns as default we should improve the
>> storage and performance for these columns. I have sent another mail to
>> discuss the improvement points. Please comment on it.
>>
>> Regards,
>> Ravindra
>>
>> On 1 March 2017 at 10:12, Ravindra Pesala <
>
>> ravi.pesala@
>
>> > wrote:
>>
>>> Hi Likun,
>>>
>>> It would be same case if we use all non dictionary columns by default, it
>>> will increase the store size and decrease the performance so it is also
>>> does not encourage more users if performance is poor.
>>>
>>> If we need to make no-dictionary columns as default then we should first
>>> focus on reducing the store size and improve the filter queries on
>>> non-dictionary columns.Even memory usage is higher while querying the
>>> non-dictionary columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 06:00, Jacky Li <
>
>> jacky.likun@
>
>> > wrote:
>>>
>>>> Yes, I agree to your point. The only concern I have is for loading, I
>>>> have seen many users accidentally put high cardinality column into
>>>> dictionary column then the loading failed because out of memory or
>>>> loading
>>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>>>> these columns, or they do not have a easy way to identify the high card
>>>> columns. I feel preventing such misusage is important in order to
>>>> encourage
>>>> more users to use carbondata.
>>>>
>>>> Any suggestion on solving this issue?
>>>>
>>>>
>>>> Regards,
>>>> Likun
>>>>
>>>>
>>>>> 在 2017年2月28日，下午10:20，Ravindra Pesala <
>
>> ravi.pesala@
>
>> > 写道：
>>>>>
>>>>> Hi Likun,
>>>>>
>>>>> You mentioned that if user does not specify dictionary columns then by
>>>>> default those are chosen as no dictionary columns.
>>>>> But we have many disadvantages as I mentioned in above mail if you
>>>> keep
>>>> no
>>>>> dictionary as default. We have initially introduced no dictionary
>>>> columns
>>>>> to handle high cardinality dimensions, but now making every thing as
>>>> no
>>>>> dictionary columns by default looses our unique feature compare to
>>>> parquet.
>>>>> Dictionary columns are introduced not only for aggregation queries, it
>>>> is
>>>>> for better compression and better filter queries as well. With out
>>>>> dictionary store size will be increased a lot.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>> On 28 February 2017 at 18:05, Liang Chen <
>
>> chenliang6136@
>
>> >
>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> A couple of questions:
>>>>>>
>>>>>> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>>>>>> index" for these columns which be specified into the option(SORT_KEY)
>>>> ?
>>>>>>
>>>>>> 2) If users don't specify TABLE_DICTIONARY, then all columns don't
>>>> make
>>>>>> dictionary encoding, and all shuffle operations are based on fact
>>>> value, is
>>>>>> my understanding right ?
>>>>>> ------------------------------------------------------------
>>>>>> -------------------------------------------
>>>>>> If this option is not specified by user, means all columns encoding
>>>> without
>>>>>> global dictionary support. Normal shuffle on decoded value will be
>>>> applied
>>>>>> when doing group by operation.
>>>>>>
>>>>>> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>>>>>> supposed if "C2" be specified into SORT_KEY, but not be specified
>>>> into
>>>>>> TABLE_DICTIONARY, then system how to handle this case ?
>>>>>> ------------------------------------------------------------
>>>>>> -----------------------------------------------
>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>>> encoded as
>>>>>> Inverted Index and with Minmax Index
>>>>>>
>>>>>> Regards
>>>>>> Liang
>>>>>>
>>>>>> 2017-02-28 19:35 GMT+08:00 Jacky Li <
>
>> jacky.likun@
>
>> >:
>>>>>>
>>>>>>> Yes, first we should simplify the DDL options. I propose following
>>>>>> options,
>>>>>>> please check weather it miss some scenario.
>>>>>>>
>>>>>>> 1. SORT_COLUMNS, or SORT_KEY
>>>>>>> This indicates three things:
>>>>>>> 1) All columns specified in options will be used to construct
>>>>>>> Multi-Dimensional Key, which will be sorted along this key
>>>>>>> 2) They will be encoded as Inverted Index and thus again sorted
>>>> within
>>>>>>> column chunk in one blocklet
>>>>>>> 3) Minmax index will also be created for these columns
>>>>>>>
>>>>>>> When to use: This option is designed for accelerating filter query,
>>>> so
>>>>>> put
>>>>>>> all filter columns into this option. The order of it can be:
>>>>>>> 1) From low cardinality to high cardinality, this will make most
>>>>>>> compression
>>>>>>> and fit for scenario that does not have frequent filter on high card
>>>>>> column
>>>>>>> 2) Put high cardinality column first, then put others. This fits for
>>>>>>> frequent filter on high card column
>>>>>>>
>>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>>> encoded
>>>>>> as
>>>>>>> Inverted Index and with Minmax Index
>>>>>>> Note that while C1,C2,C3 can be dimension but they also can be
>>>> measure.
>>>>>> So
>>>>>>> if user need to filter on measure column, it can be put in
>>>> SORT_COLUMNS
>>>>>>> option.
>>>>>>>
>>>>>>> If this option is not specified by user, carbon will pick MDK as it
>>>> is
>>>>>> now.
>>>>>>>
>>>>>>> 2. TABLE_DICTIONARY
>>>>>>> This is to specify the table level dictionary columns. Will create
>>>> global
>>>>>>> dictionary for all columns in this option for every data load.
>>>>>>>
>>>>>>> When to use: The option is designed for accelerating aggregate
>>>> query,
>>>> so
>>>>>>> put
>>>>>>> group by columns into this option
>>>>>>>
>>>>>>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>>>>>>
>>>>>>> If this option is not specified by user, means all columns encoding
>>>>>> without
>>>>>>> global dictionary support. Normal shuffle on decoded value will be
>>>>>> applied
>>>>>>> when doing group by operation.
>>>>>>>
>>>>>>> I think these two options should be the basic option for normal
>>>> user,
>>>> the
>>>>>>> goal of them is to satisfy the most scenario without deep tuning of
>>>> the
>>>>>>> table
>>>>>>> For advanced user who want to do deep tuning, we can debate to add
>>>> more
>>>>>>> options. But we need to identify what scenario is not satisfied by
>>>> using
>>>>>>> these two options first.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Jacky
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context: http://apache-carbondata-
>>>>>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>>>>>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>>>>>>> Sent from the Apache CarbonData Mailing List archive mailing list
>>>> archive
>>>>>>> at Nabble.com.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards
>>>>>> Liang
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Ravi
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Ravi
>
>
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html <http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html>
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com <http://nabble.com/>.