http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbon-data-performance-doubts-tp18438p18659.html
Thank you Jacky! Above encoding property makes sense. How would you handle
specify "dictionary_include" for that column.
> Hi Swapnil,
>
> Dictionary is beneficial for aggregation query (carbon will leverage late
> decode optimization in sql optimizer), so you can use it for columns on
> which you frequently do group by. While it can improve query performance,
> but it also requires more memory and CPU while loading. Normally, you
> should consider to use dictionary only on low cardinality columns.
>
> In current apache master branch (and all history release before 1.2),
> carbon data’s default encoding strategy favor query performance over
> loading performance. By default, all string data type by default is
> encoded as dictionary. But it creates some problems sometimes, for example,
> if there are high cardinality column in the table, loading may fail due to
> not enough memory in JVM. To avoid this, we have added DICTIONARY_EXCLUDE
> option so that user can disable this default behavior manually. So,
> DICTIONARY_EXCLUDE property is designed for String column only.
>
> And, if you have low cardinality integer column ( like some ID field), you
> can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE, so
> group by on this integer column will be faster.
>
> All these are current behavior, and there was discussion to change this
> behavior and give more control to the user, in the coming release (1.2)
> The new proposed target behavior will be:
> 1. There will be a default encoding strategy for each data type. If user
> does not specify any encoding related property in CREATE TABLE, carbon will
> use the default encoding strategy for each column.
> 2. And there will be a ENCODING property through which user can override
> the system default strategy. For example, user can create table by:
>
> CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area
> DOUBLE)
> TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: {dictionary,
> RLE}, population: delta’)
>
> This SQL means city_name is encoded using dictionary, city_id is encoded
> using dictionary then apply RLE encoding (for numeric value), population is
> encoded using delta encoding, and area is encoded using system default
> encoding for double data type.
>
> This change is still going on (CARBONDATA-1014,
https://issues.apache.org/> jira/browse/CARBONDATA-1014 <
https://issues.apache.org/> jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once
> it is done and stable it will be merged into master.
>
> Please advise if you have any suggestions.
>
> Regards,
> Jacky
>
>
> > 在 2017年7月21日,上午12:12,Swapnil Shinde <
[hidden email]> 写道:
> >
> > Ok. Just curious - Any reason not to support numeric columns with
> > dictionary_exclude? Wouldn't it be useful for numeric unique column which
> > should be dimension but avoid creating dictionary (as it may not be
> > beneficial).
> >
> > Thanks
> > Swapnil
> >
> >
> > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <
>
[hidden email]>
> > wrote:
> >
> >> No Dictionary_Exclude is supported only for String data type columns.
> >>
> >> Regards
> >> Manish Gupta
> >>
> >>
> >>
> >> --
> >> View this message in context:
http://apache-carbondata-dev-> >> mailing-list-archive.1130556.n5.nabble.com/carbon-data-
> performance-doubts-
> >> tp18438p18559.html
> >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> >> archive at Nabble.com.
> >>
>
>