http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Encoding-override-and-extensibility-tp12633p12744.html
> Hi,
>
> I mentioned there will be a default strategy if user does not set any
> encoding options. For example, if user does not set encoding option for
> high cardinality dimension column, carbon will use default encoding which
> is LV_BYTES_ENCODE for this column.
>
> Regards,
> Jacky
>
> > 在 2017年5月16日,下午5:54,Liang Chen <
[hidden email]> 写道:
> >
> > Hi
> >
> > This is a great discussion for further making "encoding functions" easier
> > use.
> >
> > Expose all these options to users for different business cases, this is
> > good.But to be frank, it is difficult for general users to understand
> all
> > options and do an exact configuration.
> > So we need to consider more about "default option " or "default option
> > group" when designing solution.
> >
> > For example : to set high cardinality column with ‘LV_BYTES_ENCODE’=‘C1’,
> > what is the default encoding behaviors if users don't set any option for
> > these columns?
> >
> > Regards
> > Liang
> >
> > 2017-05-13 23:34 GMT-04:00 Jacky Li <
[hidden email] <mailto:
>
[hidden email]>>:
> >
> >> For dictionary encoding related behavior, we had a discussion back in
> >> March:
> >>
http://apache-carbondata-dev-mailing-list-archive.1130556 <
>
http://apache-carbondata-dev-mailing-list-archive.1130556/>.
> >> n5.nabble.com/DISCUSS-For-the-dimension-default-should-be- <
>
http://n5.nabble.com/DISCUSS-For-the-dimension-default-should-be->
> >> no-dictionary-td8010.html <
http://apache-carbondata-dev- <
>
http://apache-carbondata-dev-/>
> >> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- <
>
http://mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the->
> >> dimension-default-should-be-no-dictionary-td8010.html>
> >>
http://apache-carbondata-dev-mailing-list-archive.1130556 <
>
http://apache-carbondata-dev-mailing-list-archive.1130556/>.
> >> n5.nabble.com/Improving-Non-dictionary-storage-amp-
> performance-td8146.html <
http://n5.nabble.com/Improving-Non-dictionary-> storage-amp-performance-td8146.html>
> >> <
http://apache-carbondata-dev-mailing-list-archive.1130556 <
>
http://apache-carbondata-dev-mailing-list-archive.1130556/>.
> >> n5.nabble.com/Improving-Non-dictionary-storage-amp-
> performance-td8146.html <
http://n5.nabble.com/Improving-Non-dictionary-> storage-amp-performance-td8146.html>
> >>>
> >>
> >> From these two mail thread, we conclude that:
> >> 1. Initial idea of non-dictionary is only for high cardinality dimension
> >> column, they should not be the default encoding for all dimension
> columns.
> >> 2. While there are some suggestions in the mail thread to improve the
> >> usability of the DDL, we still need to find a way to make it simpler for
> >> user to control the encoding. So I propose a new solution here in this
> >> thread.
> >>
> >>
> >> The main goal of this proposal is to introduce new TBLPROPERTY to make
> it
> >> simpler to control the column encoding and also make it extensible by
> >> developers.
> >> Following is the proposal
> >>
> >> 1. Encoding override
> >> I propose to introduce a set of keyword in TBLPROPERTY to control
> encoding
> >> of each field in the table. The goal is to make it simpler for user to
> >> control the encoding.
> >> One keyword represent one encoding type. Currently we have three
> encoding
> >> type for dimension and two for measure:
> >>
> >> For dimension:
> >> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary
> >> encoding
> >> 2) LV_BYTES_ENCODE, for high cardinality string column and
> complex
> >> data type column, that are currently encoded as Length-Value encoded
> byte
> >> array
> >> 3) INVERTED_INDEX_ENCODE, for low cardinality column
> >>
> >> For measure:
> >> 1) DELTA_ENCODE: use delta encoding
> >> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
> >>
> >> User can control the encoding for example:
> >> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING,
> C5
> >> INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
> >> STORED BY carbondata
> >> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’,
> >> ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’,
> >> ‘DELTA_ENCODE’=‘C5’)
> >>
> >> In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global
> >> dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary),
> C5
> >> is encoded using Delta, and other columns (C6/C7) are encoded using
> default
> >> strategy.
> >> Using this approach, advantage is that:
> >> 1) express encoding independent with MDK columns, a requirement
> >> from community for long time.
> >> 2) compare the efficiency of certain encoding, by explicitly
> >> specify different encoding for the same field in two tables. This is
> >> required when exploring new encoding method.
> >>
> >> 2. Default strategy
> >> Using above keyword, user can override the encoding method for specific
> >> column. If user does not specify those keywords, CarbonData will choose
> >> encoding method based on a default strategy. The default strategy is the
> >> same as current CarbonData 1.1 implementation, to ensure backward
> >> compatibility.
> >>
> >> In future, this default strategy could also be changed if better
> strategy
> >> is found, for example, by heuristic rules based on data distribution
> rather
> >> than just data type.
> >>
> >> 3. Encoding cascading
> >> Encoding TBLPROPERTY can be cascading, for example:
> >> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’,
> >> ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global
> dictionary
> >> firstly and then encoded as inverted index using the dictionary encode
> >> output.
> >>
> >> Using this approach, user can control whether to do inverted index for
> >> each column.
> >> This feature currently is mainly for inverted index, still need to
> explore
> >> whether it is suitable for all encoding methods.
> >>
> >> 4. Encoding extensibility
> >> Besides the current supported encoding methods, we can make it
> extensible
> >> by developers. Developers can implement the encode/decode interface and
> >> provide it a short name with ‘_ENCODING’ suffix. For example:
> >> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above
> >> example.
> >>
> >> Using this approach for extension, there are some potential new encoding
> >> that we can consider in future:
> >> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality
> is
> >> not so high so that we can do dictionary within one file.
> >> 2) BITMAP_ENCODE, for low cardinality column
> >> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by
> >> Facebook in Gorilla (
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <
>
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf> <
> >>
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
> >> 4) XOR_ENCODE, for floating point measure, invented by Facebook
> in
> >> Gorilla
> >>
> >> As in first development iteration, only native encoding will be support
> so
> >> that these new encoding should be added into CarbonData project. In
> second
> >> iteration, we can consider to open interface for 3rd party developer to
> add
> >> encoding outside of CarbonData project, maybe by providing encoding
> class
> >> name explicitly in another independent TBLPROPERTY option.
> >>
> >> 5. Improvement on storage and performance of high cardinality column
> >> Ravindra has proposed some action item for non-dictionary encoding in
> >> above mentioned threads, to improve storage size and performance. They
> are
> >> still valid now and we should work on them along the work in this
> thread.
> >>
> >>
> >> ———— proposal ends
> >>
> >> Please comment on this proposal focusing on:
> >> 1. Whether total design is clean or need improvement
> >> 2. Current me if wrong for the existing encoding methods. Encoding
> >> TBLPROPERTY option name is open for comment, you can suggest if have
> better
> >> one, especially for LV_BYTES_ENCODING (I am not feeling very confident
> with
> >> this one)
> >> 3. The idea of encoding cascading, make it work like this or we
> enumerate
> >> all encoding methods
> >> 4. You can suggest more potential encoding of your preference
> >>
> >>
> >> Regards,
> >> Jacky Li
> >>
> >>
> >>
> >
> >
> > --
> > Regards
> > Liang
>
>