Login  Register

Re: [DISCUSSION] Encoding override and extensibility

Posted by Jacky Li on May 17, 2017; 1:29am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Encoding-override-and-extensibility-tp12633p12750.html

Sure, I think we can refer to <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>>
There is a list of requirement we can plan to do.

Regards,
Jacky

> 在 2017年5月14日,上午11:34,Jacky Li <[hidden email]> 写道:
>
> For dictionary encoding related behavior, we had a discussion back in March:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-td8010.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-td8010.html>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>
>
> From these two mail thread, we conclude that:
> 1. Initial idea of non-dictionary is only for high cardinality dimension column, they should not be the default encoding for all dimension columns.
> 2. While there are some suggestions in the mail thread to improve the usability of the DDL, we still need to find a way to make it simpler for user to control the encoding. So I propose a new solution here in this thread.
>
>
> The main goal of this proposal is to introduce new TBLPROPERTY to make it simpler to control the column encoding and also make it extensible by developers.
> Following is the proposal
>
> 1. Encoding override
> I propose to introduce a set of keyword in TBLPROPERTY to control encoding of each field in the table. The goal is to make it simpler for user to control the encoding.
> One keyword represent one encoding type. Currently we have three encoding type for dimension and two for measure:
>
> For dimension:
> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary encoding
> 2) LV_BYTES_ENCODE, for high cardinality string column and complex data type column, that are currently encoded as Length-Value encoded byte array
> 3) INVERTED_INDEX_ENCODE, for low cardinality column
>
> For measure:
> 1) DELTA_ENCODE: use delta encoding
> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
>
> User can control the encoding for example:
> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING, C5 INT, C6 INT, C7 STRING)     // suppose C1 is high cardinality column
> STORED BY carbondata
> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’, ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’, ‘DELTA_ENCODE’=‘C5’)
>
> In this example, MDK is C1 and C3,  C2/C3/C4 are encoded as global dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary), C5 is encoded using Delta, and other columns (C6/C7) are encoded using default strategy.
> Using this approach, advantage is that:
> 1) express encoding independent with MDK columns, a requirement from community for long time.
> 2) compare the efficiency of certain encoding, by explicitly specify different encoding for the same field in two tables. This is required when exploring new encoding method.
>
> 2. Default strategy
> Using above keyword, user can override the encoding method for specific column. If user does not specify those keywords, CarbonData will choose encoding method based on a default strategy. The default strategy is the same as current CarbonData 1.1 implementation, to ensure backward compatibility.
>
> In future, this default strategy could also be changed if better strategy is found, for example, by heuristic rules based on data distribution rather than just data type.
>
> 3. Encoding cascading
> Encoding TBLPROPERTY can be cascading, for example:
> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global dictionary firstly and then encoded as inverted index using the dictionary encode output.
>
> Using this approach, user can control whether to do inverted index for each column.
> This feature currently is mainly for inverted index, still need to explore whether it is suitable for all encoding methods.
>
> 4. Encoding extensibility
> Besides the current supported encoding methods, we can make it extensible by developers. Developers can implement the encode/decode interface and provide it a short name with ‘_ENCODING’ suffix. For example:
> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above example.
>
> Using this approach for extension, there are some potential new encoding that we can consider in future:
> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality is not so high so that we can do dictionary within one file.
> 2) BITMAP_ENCODE, for low cardinality column
> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by Facebook in Gorilla (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
> 4) XOR_ENCODE, for floating point measure, invented by Facebook in Gorilla
>
> As in first development iteration, only native encoding will be support so that these new encoding should be added into CarbonData project. In second iteration, we can consider to open interface for 3rd party developer to add encoding outside of CarbonData project, maybe by providing encoding class name explicitly in another independent TBLPROPERTY option.
>
> 5. Improvement on storage and performance of high cardinality column
> Ravindra has proposed some action item for non-dictionary encoding in above mentioned threads, to improve storage size and performance. They are still valid now and we should work on them along the work in this thread.
>
>
> ———— proposal ends
>
> Please comment on this proposal focusing on:
> 1. Whether total design is clean or need improvement
> 2. Current me if wrong for the existing encoding methods. Encoding TBLPROPERTY option name is open for comment, you can suggest if have better one, especially for LV_BYTES_ENCODING (I am not feeling very confident with this one)
> 3. The idea of encoding cascading, make it work like this or we enumerate all encoding methods
> 4. You can suggest more potential encoding of your preference
>
>
> Regards,
> Jacky Li
>
>
>