http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8214.html
1. I think Ravindra and Vishal’s point is valid, we should keep default is dictionary before we have improved performance of no-dictionary column.
2. For sorting, default should be carbon’s current behavior (picking dimension according to default rule automatically as the MDK). If user specify SORT_COLUMNS, then use it. I think SORT_EXCLUDE is not required.
> 在 2017年3月3日,上午12:22,bill.zhou <
[hidden email]> 写道:
>
> hi All
> I summary this discussion.
> 1. to make carbonData compatibility for older vesion, keep
> DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not
> suggestion change this properties to table_dictionary.
> 2. Suggestion keep the sort_column properties as the same style for
> dictionary. so this new properties suggestion use SORT_INCLUDE and
> SORT_EXCLUDE, default is no sort.
>
> Regards
> Bill
>
>
> ravipesala wrote
>> Hi All,
>>
>> In order to make no-dictionary columns as default we should improve the
>> storage and performance for these columns. I have sent another mail to
>> discuss the improvement points. Please comment on it.
>>
>> Regards,
>> Ravindra
>>
>> On 1 March 2017 at 10:12, Ravindra Pesala <
>
>> ravi.pesala@
>
>> > wrote:
>>
>>> Hi Likun,
>>>
>>> It would be same case if we use all non dictionary columns by default, it
>>> will increase the store size and decrease the performance so it is also
>>> does not encourage more users if performance is poor.
>>>
>>> If we need to make no-dictionary columns as default then we should first
>>> focus on reducing the store size and improve the filter queries on
>>> non-dictionary columns.Even memory usage is higher while querying the
>>> non-dictionary columns.
>>>
>>> Regards,
>>> Ravindra.
>>>
>>> On 1 March 2017 at 06:00, Jacky Li <
>
>> jacky.likun@
>
>> > wrote:
>>>
>>>> Yes, I agree to your point. The only concern I have is for loading, I
>>>> have seen many users accidentally put high cardinality column into
>>>> dictionary column then the loading failed because out of memory or
>>>> loading
>>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>>>> these columns, or they do not have a easy way to identify the high card
>>>> columns. I feel preventing such misusage is important in order to
>>>> encourage
>>>> more users to use carbondata.
>>>>
>>>> Any suggestion on solving this issue?
>>>>
>>>>
>>>> Regards,
>>>> Likun
>>>>
>>>>
>>>>> 在 2017年2月28日,下午10:20,Ravindra Pesala <
>
>> ravi.pesala@
>
>> > 写道:
>>>>>
>>>>> Hi Likun,
>>>>>
>>>>> You mentioned that if user does not specify dictionary columns then by
>>>>> default those are chosen as no dictionary columns.
>>>>> But we have many disadvantages as I mentioned in above mail if you
>>>> keep
>>>> no
>>>>> dictionary as default. We have initially introduced no dictionary
>>>> columns
>>>>> to handle high cardinality dimensions, but now making every thing as
>>>> no
>>>>> dictionary columns by default looses our unique feature compare to
>>>> parquet.
>>>>> Dictionary columns are introduced not only for aggregation queries, it
>>>> is
>>>>> for better compression and better filter queries as well. With out
>>>>> dictionary store size will be increased a lot.
>>>>>
>>>>> Regards,
>>>>> Ravindra.
>>>>>
>>>>> On 28 February 2017 at 18:05, Liang Chen <
>
>> chenliang6136@
>
>> >
>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> A couple of questions:
>>>>>>
>>>>>> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>>>>>> index" for these columns which be specified into the option(SORT_KEY)
>>>> ?
>>>>>>
>>>>>> 2) If users don't specify TABLE_DICTIONARY, then all columns don't
>>>> make
>>>>>> dictionary encoding, and all shuffle operations are based on fact
>>>> value, is
>>>>>> my understanding right ?
>>>>>> ------------------------------------------------------------
>>>>>> -------------------------------------------
>>>>>> If this option is not specified by user, means all columns encoding
>>>> without
>>>>>> global dictionary support. Normal shuffle on decoded value will be
>>>> applied
>>>>>> when doing group by operation.
>>>>>>
>>>>>> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>>>>>> supposed if "C2" be specified into SORT_KEY, but not be specified
>>>> into
>>>>>> TABLE_DICTIONARY, then system how to handle this case ?
>>>>>> ------------------------------------------------------------
>>>>>> -----------------------------------------------
>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>>> encoded as
>>>>>> Inverted Index and with Minmax Index
>>>>>>
>>>>>> Regards
>>>>>> Liang
>>>>>>
>>>>>> 2017-02-28 19:35 GMT+08:00 Jacky Li <
>
>> jacky.likun@
>
>> >:
>>>>>>
>>>>>>> Yes, first we should simplify the DDL options. I propose following
>>>>>> options,
>>>>>>> please check weather it miss some scenario.
>>>>>>>
>>>>>>> 1. SORT_COLUMNS, or SORT_KEY
>>>>>>> This indicates three things:
>>>>>>> 1) All columns specified in options will be used to construct
>>>>>>> Multi-Dimensional Key, which will be sorted along this key
>>>>>>> 2) They will be encoded as Inverted Index and thus again sorted
>>>> within
>>>>>>> column chunk in one blocklet
>>>>>>> 3) Minmax index will also be created for these columns
>>>>>>>
>>>>>>> When to use: This option is designed for accelerating filter query,
>>>> so
>>>>>> put
>>>>>>> all filter columns into this option. The order of it can be:
>>>>>>> 1) From low cardinality to high cardinality, this will make most
>>>>>>> compression
>>>>>>> and fit for scenario that does not have frequent filter on high card
>>>>>> column
>>>>>>> 2) Put high cardinality column first, then put others. This fits for
>>>>>>> frequent filter on high card column
>>>>>>>
>>>>>>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>>> encoded
>>>>>> as
>>>>>>> Inverted Index and with Minmax Index
>>>>>>> Note that while C1,C2,C3 can be dimension but they also can be
>>>> measure.
>>>>>> So
>>>>>>> if user need to filter on measure column, it can be put in
>>>> SORT_COLUMNS
>>>>>>> option.
>>>>>>>
>>>>>>> If this option is not specified by user, carbon will pick MDK as it
>>>> is
>>>>>> now.
>>>>>>>
>>>>>>> 2. TABLE_DICTIONARY
>>>>>>> This is to specify the table level dictionary columns. Will create
>>>> global
>>>>>>> dictionary for all columns in this option for every data load.
>>>>>>>
>>>>>>> When to use: The option is designed for accelerating aggregate
>>>> query,
>>>> so
>>>>>>> put
>>>>>>> group by columns into this option
>>>>>>>
>>>>>>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>>>>>>>
>>>>>>> If this option is not specified by user, means all columns encoding
>>>>>> without
>>>>>>> global dictionary support. Normal shuffle on decoded value will be
>>>>>> applied
>>>>>>> when doing group by operation.
>>>>>>>
>>>>>>> I think these two options should be the basic option for normal
>>>> user,
>>>> the
>>>>>>> goal of them is to satisfy the most scenario without deep tuning of
>>>> the
>>>>>>> table
>>>>>>> For advanced user who want to do deep tuning, we can debate to add
>>>> more
>>>>>>> options. But we need to identify what scenario is not satisfied by
>>>> using
>>>>>>> these two options first.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Jacky
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
http://apache-carbondata->>>>>>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>>>>>>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>>>>>>> Sent from the Apache CarbonData Mailing List archive mailing list
>>>> archive
>>>>>>> at Nabble.com.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards
>>>>>> Liang
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Ravi
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Ravi
>
>
>
>
> --
> View this message in context:
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html <
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8198.html>
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com <
http://nabble.com/>.