http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8127.html
does not encourage more users if performance is poor.
non-dictionary columns.
Ravindra.
> Yes, I agree to your point. The only concern I have is for loading, I have
> seen many users accidentally put high cardinality column into dictionary
> column then the loading failed because out of memory or loading very slow.
> I guess they just do not know to use DICTIONARY_EXCLUDE for these columns,
> or they do not have a easy way to identify the high card columns. I feel
> preventing such misusage is important in order to encourage more users to
> use carbondata.
>
> Any suggestion on solving this issue?
>
>
> Regards,
> Likun
>
>
> > 在 2017年2月28日,下午10:20,Ravindra Pesala <
[hidden email]> 写道:
> >
> > Hi Likun,
> >
> > You mentioned that if user does not specify dictionary columns then by
> > default those are chosen as no dictionary columns.
> > But we have many disadvantages as I mentioned in above mail if you keep
> no
> > dictionary as default. We have initially introduced no dictionary columns
> > to handle high cardinality dimensions, but now making every thing as no
> > dictionary columns by default looses our unique feature compare to
> parquet.
> > Dictionary columns are introduced not only for aggregation queries, it is
> > for better compression and better filter queries as well. With out
> > dictionary store size will be increased a lot.
> >
> > Regards,
> > Ravindra.
> >
> > On 28 February 2017 at 18:05, Liang Chen <
[hidden email]>
> wrote:
> >
> >> Hi
> >>
> >> A couple of questions:
> >>
> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
> >> index" for these columns which be specified into the option(SORT_KEY) ?
> >>
> >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't make
> >> dictionary encoding, and all shuffle operations are based on fact
> value, is
> >> my understanding right ?
> >> ------------------------------------------------------------
> >> -------------------------------------------
> >> If this option is not specified by user, means all columns encoding
> without
> >> global dictionary support. Normal shuffle on decoded value will be
> applied
> >> when doing group by operation.
> >>
> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
> >> supposed if "C2" be specified into SORT_KEY, but not be specified into
> >> TABLE_DICTIONARY, then system how to handle this case ?
> >> ------------------------------------------------------------
> >> -----------------------------------------------
> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> as
> >> Inverted Index and with Minmax Index
> >>
> >> Regards
> >> Liang
> >>
> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <
[hidden email]>:
> >>
> >>> Yes, first we should simplify the DDL options. I propose following
> >> options,
> >>> please check weather it miss some scenario.
> >>>
> >>> 1. SORT_COLUMNS, or SORT_KEY
> >>> This indicates three things:
> >>> 1) All columns specified in options will be used to construct
> >>> Multi-Dimensional Key, which will be sorted along this key
> >>> 2) They will be encoded as Inverted Index and thus again sorted within
> >>> column chunk in one blocklet
> >>> 3) Minmax index will also be created for these columns
> >>>
> >>> When to use: This option is designed for accelerating filter query, so
> >> put
> >>> all filter columns into this option. The order of it can be:
> >>> 1) From low cardinality to high cardinality, this will make most
> >>> compression
> >>> and fit for scenario that does not have frequent filter on high card
> >> column
> >>> 2) Put high cardinality column first, then put others. This fits for
> >>> frequent filter on high card column
> >>>
> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and encoded
> >> as
> >>> Inverted Index and with Minmax Index
> >>> Note that while C1,C2,C3 can be dimension but they also can be measure.
> >> So
> >>> if user need to filter on measure column, it can be put in SORT_COLUMNS
> >>> option.
> >>>
> >>> If this option is not specified by user, carbon will pick MDK as it is
> >> now.
> >>>
> >>> 2. TABLE_DICTIONARY
> >>> This is to specify the table level dictionary columns. Will create
> global
> >>> dictionary for all columns in this option for every data load.
> >>>
> >>> When to use: The option is designed for accelerating aggregate query,
> so
> >>> put
> >>> group by columns into this option
> >>>
> >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
> >>>
> >>> If this option is not specified by user, means all columns encoding
> >> without
> >>> global dictionary support. Normal shuffle on decoded value will be
> >> applied
> >>> when doing group by operation.
> >>>
> >>> I think these two options should be the basic option for normal user,
> the
> >>> goal of them is to satisfy the most scenario without deep tuning of the
> >>> table
> >>> For advanced user who want to do deep tuning, we can debate to add more
> >>> options. But we need to identify what scenario is not satisfied by
> using
> >>> these two options first.
> >>>
> >>> Regards,
> >>> Jacky
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
http://apache-carbondata-> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
> >>> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >>> at Nabble.com.
> >>>
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>