http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Encoding-override-and-extensibility-tp12633p12716.html
use.
options and do an exact configuration.
group" when designing solution.
> For dictionary encoding related behavior, we had a discussion back in
> March:
>
http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-
> no-dictionary-td8010.html <
http://apache-carbondata-dev-> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-td8010.html>
>
http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html
> <
http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html
> >
>
> From these two mail thread, we conclude that:
> 1. Initial idea of non-dictionary is only for high cardinality dimension
> column, they should not be the default encoding for all dimension columns.
> 2. While there are some suggestions in the mail thread to improve the
> usability of the DDL, we still need to find a way to make it simpler for
> user to control the encoding. So I propose a new solution here in this
> thread.
>
>
> The main goal of this proposal is to introduce new TBLPROPERTY to make it
> simpler to control the column encoding and also make it extensible by
> developers.
> Following is the proposal
>
> 1. Encoding override
> I propose to introduce a set of keyword in TBLPROPERTY to control encoding
> of each field in the table. The goal is to make it simpler for user to
> control the encoding.
> One keyword represent one encoding type. Currently we have three encoding
> type for dimension and two for measure:
>
> For dimension:
> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary
> encoding
> 2) LV_BYTES_ENCODE, for high cardinality string column and complex
> data type column, that are currently encoded as Length-Value encoded byte
> array
> 3) INVERTED_INDEX_ENCODE, for low cardinality column
>
> For measure:
> 1) DELTA_ENCODE: use delta encoding
> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
>
> User can control the encoding for example:
> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING, C5
> INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
> STORED BY carbondata
> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’,
> ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’,
> ‘DELTA_ENCODE’=‘C5’)
>
> In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global
> dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary), C5
> is encoded using Delta, and other columns (C6/C7) are encoded using default
> strategy.
> Using this approach, advantage is that:
> 1) express encoding independent with MDK columns, a requirement
> from community for long time.
> 2) compare the efficiency of certain encoding, by explicitly
> specify different encoding for the same field in two tables. This is
> required when exploring new encoding method.
>
> 2. Default strategy
> Using above keyword, user can override the encoding method for specific
> column. If user does not specify those keywords, CarbonData will choose
> encoding method based on a default strategy. The default strategy is the
> same as current CarbonData 1.1 implementation, to ensure backward
> compatibility.
>
> In future, this default strategy could also be changed if better strategy
> is found, for example, by heuristic rules based on data distribution rather
> than just data type.
>
> 3. Encoding cascading
> Encoding TBLPROPERTY can be cascading, for example:
> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’,
> ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global dictionary
> firstly and then encoded as inverted index using the dictionary encode
> output.
>
> Using this approach, user can control whether to do inverted index for
> each column.
> This feature currently is mainly for inverted index, still need to explore
> whether it is suitable for all encoding methods.
>
> 4. Encoding extensibility
> Besides the current supported encoding methods, we can make it extensible
> by developers. Developers can implement the encode/decode interface and
> provide it a short name with ‘_ENCODING’ suffix. For example:
> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above
> example.
>
> Using this approach for extension, there are some potential new encoding
> that we can consider in future:
> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality is
> not so high so that we can do dictionary within one file.
> 2) BITMAP_ENCODE, for low cardinality column
> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by
> Facebook in Gorilla (
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <
>
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
> 4) XOR_ENCODE, for floating point measure, invented by Facebook in
> Gorilla
>
> As in first development iteration, only native encoding will be support so
> that these new encoding should be added into CarbonData project. In second
> iteration, we can consider to open interface for 3rd party developer to add
> encoding outside of CarbonData project, maybe by providing encoding class
> name explicitly in another independent TBLPROPERTY option.
>
> 5. Improvement on storage and performance of high cardinality column
> Ravindra has proposed some action item for non-dictionary encoding in
> above mentioned threads, to improve storage size and performance. They are
> still valid now and we should work on them along the work in this thread.
>
>
> ———— proposal ends
>
> Please comment on this proposal focusing on:
> 1. Whether total design is clean or need improvement
> 2. Current me if wrong for the existing encoding methods. Encoding
> TBLPROPERTY option name is open for comment, you can suggest if have better
> one, especially for LV_BYTES_ENCODING (I am not feeling very confident with
> this one)
> 3. The idea of encoding cascading, make it work like this or we enumerate
> all encoding methods
> 4. You can suggest more potential encoding of your preference
>
>
> Regards,
> Jacky Li
>
>
>