Apache CarbonData Dev Mailing List archive - Re: [DISCUSS] For the dimension default should be no dictionary

Apache CarbonData Dev Mailing List archive

Re: [DISCUSS] For the dimension default should be no dictionary

Posted by Liang Chen on Feb 28, 2017; 12:35pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8096.html

Hi

A couple of questions:

1) For SORT_KEY option: only build "MDK index, inverted index, minmax
index" for these columns which be specified into the option(SORT_KEY) ?

2) If users don't specify TABLE_DICTIONARY, then all columns don't make
dictionary encoding, and all shuffle operations are based on fact value, is
my understanding right ?
-------------------------------------------------------------------------------------------------------
If this option is not specified by user, means all columns encoding without
global dictionary support. Normal shuffle on decoded value will be applied
when doing group by operation.

3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
supposed if "C2" be specified into SORT_KEY, but not be specified into
TABLE_DICTIONARY, then system how to handle this case ?
-----------------------------------------------------------------------------------------------------------
For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
Inverted Index and with Minmax Index

Regards
Liang

2017-02-28 19:35 GMT+08:00 Jacky Li <[hidden email]>:

> Yes, first we should simplify the DDL options. I propose following options,
> please check weather it miss some scenario.
>
> 1. SORT_COLUMNS, or SORT_KEY
> This indicates three things:
> 1) All columns specified in options will be used to construct
> Multi-Dimensional Key, which will be sorted along this key
> 2) They will be encoded as Inverted Index and thus again sorted within
> column chunk in one blocklet
> 3) Minmax index will also be created for these columns
>
> When to use: This option is designed for accelerating filter query, so put
> all filter columns into this option. The order of it can be:
> 1) From low cardinality to high cardinality, this will make most
> compression
> and fit for scenario that does not have frequent filter on high card column
> 2) Put high cardinality column first, then put others. This fits for
> frequent filter on high card column
>
> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded as
> Inverted Index and with Minmax Index
> Note that while C1,C2,C3 can be dimension but they also can be measure. So
> if user need to filter on measure column, it can be put in SORT_COLUMNS
> option.
>
> If this option is not specified by user, carbon will pick MDK as it is now.
>
> 2. TABLE_DICTIONARY
> This is to specify the table level dictionary columns. Will create global
> dictionary for all columns in this option for every data load.
>
> When to use: The option is designed for accelerating aggregate query, so
> put
> group by columns into this option
>
> For example. TABLE_DICTIONARY=“C2,C3,C5”
>
> If this option is not specified by user, means all columns encoding without
> global dictionary support. Normal shuffle on decoded value will be applied
> when doing group by operation.
>
> I think these two options should be the basic option for normal user, the
> goal of them is to satisfy the most scenario without deep tuning of the
> table
> For advanced user who want to do deep tuning, we can debate to add more
> options. But we need to identify what scenario is not satisfied by using
> these two options first.
>
> Regards,
> Jacky
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-tp8010p8081.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

--
Regards
Liang