Apache CarbonData Dev Mailing List archive

Re: carbon data performance doubts

Posted by Liang Chen-2 on Jul 22, 2017; 1:22am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/carbon-data-performance-doubts-tp18438p18660.html

Hi Swapnil

Actually, current system's behavior is : Index and dictionary encoding
are decoupled, no relationship.

1. If you want to make some columns have good filter , just add these
columns to sort_columns (like tblproperties('sort_columns'='empno')), to
build good MDX index for these columns, just add INT column to
sort_columns list for filter.

2. If you want to make some columns have good aggregation for group by,
just dictionary encodes these columns. By default INT column doesn't do
dictionary encode, so don't need to add "DICTIONARY_EXCLUDE", if the INT
column is low cardinality and you also want to have good aggregation on the
INT column, use "DICTIONARY_INCLUDE = the INT column".

So , in a word : INT column with high cardinality doesn't have
DICTIONARY_EXCLUDE scenario :)

HTH.

Regards
Liang

2017-07-22 6:09 GMT+08:00 Swapnil Shinde <[hidden email]>:

> Thank you Jacky! Above encoding property makes sense. How would you handle
> an INT column with high cardinality? as per my understanding, this column
> will be considered as measure and only way to make it dimension is to
> specify "dictionary_include" for that column.
> Any reason why a column being a dimension or measure is tied with
> dictionary encoding? Does it make sense to have column as dimension with no
> encoding so indexes can be used for filter?
>
> Thanks
> Swapnil
>
>
> On Fri, Jul 21, 2017 at 12:30 PM, Jacky Li <[hidden email]> wrote:
>
> > Hi Swapnil,
> >
> > Dictionary is beneficial for aggregation query (carbon will leverage late
> > decode optimization in sql optimizer), so you can use it for columns on
> > which you frequently do group by. While it can improve query performance,
> > but it also requires more memory and CPU while loading. Normally, you
> > should consider to use dictionary only on low cardinality columns.
> >
> > In current apache master branch (and all history release before 1.2),
> > carbon data’s default encoding strategy favor query performance over
> > loading performance. By default, all string data type by default is
> > encoded as dictionary. But it creates some problems sometimes, for
> example,
> > if there are high cardinality column in the table, loading may fail due
> to
> > not enough memory in JVM. To avoid this, we have added DICTIONARY_EXCLUDE
> > option so that user can disable this default behavior manually. So,
> > DICTIONARY_EXCLUDE property is designed for String column only.
> >
> > And, if you have low cardinality integer column ( like some ID field),
> you
> > can choose to encode it as dictionary by specifying DICTIONARY_INCLUDE,
> so
> > group by on this integer column will be faster.
> >
> > All these are current behavior, and there was discussion to change this
> > behavior and give more control to the user, in the coming release (1.2)
> > The new proposed target behavior will be:
> > 1. There will be a default encoding strategy for each data type. If user
> > does not specify any encoding related property in CREATE TABLE, carbon
> will
> > use the default encoding strategy for each column.
> > 2. And there will be a ENCODING property through which user can override
> > the system default strategy. For example, user can create table by:
> >
> > CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area
> > DOUBLE)
> > TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: {dictionary,
> > RLE}, population: delta’)
> >
> > This SQL means city_name is encoded using dictionary, city_id is encoded
> > using dictionary then apply RLE encoding (for numeric value), population
> is
> > encoded using delta encoding, and area is encoded using system default
> > encoding for double data type.
> >
> > This change is still going on (CARBONDATA-1014,
> https://issues.apache.org/
> > jira/browse/CARBONDATA-1014 <https://issues.apache.org/
> > jira/browse/CARBONDATA-1014>), on apache/encoding_override branch. Once
> > it is done and stable it will be merged into master.
> >
> > Please advise if you have any suggestions.
> >
> > Regards,
> > Jacky
> >
> >
> > > 在 2017年7月21日，上午12:12，Swapnil Shinde <[hidden email]> 写道：
> > >
> > > Ok. Just curious - Any reason not to support numeric columns with
> > > dictionary_exclude? Wouldn't it be useful for numeric unique column
> which
> > > should be dimension but avoid creating dictionary (as it may not be
> > > beneficial).
> > >
> > > Thanks
> > > Swapnil
> > >
> > >
> > > On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <
> > [hidden email]>
> > > wrote:
> > >
> > >> No Dictionary_Exclude is supported only for String data type columns.
> > >>
> > >> Regards
> > >> Manish Gupta
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context: http://apache-carbondata-dev-
> > >> mailing-list-archive.1130556.n5.nabble.com/carbon-data-
> > performance-doubts-
> > >> tp18438p18559.html
> > >> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> > >> archive at Nabble.com.
> > >>
> >
> >
>