Apache CarbonData Dev Mailing List archive

[DISCUSSION] Encoding override and extensibility

Classic

List

Threaded

5 messages Options

Jacky Li

May 14, 2017; 3:34am

[DISCUSSION] Encoding override and extensibility

228 posts

For dictionary encoding related behavior, we had a discussion back in March:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-td8010.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-td8010.html>
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>

From these two mail thread, we conclude that:
1. Initial idea of non-dictionary is only for high cardinality dimension column, they should not be the default encoding for all dimension columns.
2. While there are some suggestions in the mail thread to improve the usability of the DDL, we still need to find a way to make it simpler for user to control the encoding. So I propose a new solution here in this thread.

The main goal of this proposal is to introduce new TBLPROPERTY to make it simpler to control the column encoding and also make it extensible by developers.
Following is the proposal

1. Encoding override
I propose to introduce a set of keyword in TBLPROPERTY to control encoding of each field in the table. The goal is to make it simpler for user to control the encoding.
One keyword represent one encoding type. Currently we have three encoding type for dimension and two for measure:

For dimension:
1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary encoding
2) LV_BYTES_ENCODE, for high cardinality string column and complex data type column, that are currently encoded as Length-Value encoded byte array
3) INVERTED_INDEX_ENCODE, for low cardinality column

For measure:
1) DELTA_ENCODE: use delta encoding
2) ADAPTIVE_ENCODE: encode value using adaptive data type.

User can control the encoding for example:
CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING, C5 INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
STORED BY carbondata
TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’, ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’, ‘DELTA_ENCODE’=‘C5’)

In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary), C5 is encoded using Delta, and other columns (C6/C7) are encoded using default strategy.
Using this approach, advantage is that:
1) express encoding independent with MDK columns, a requirement from community for long time.
2) compare the efficiency of certain encoding, by explicitly specify different encoding for the same field in two tables. This is required when exploring new encoding method.

2. Default strategy
Using above keyword, user can override the encoding method for specific column. If user does not specify those keywords, CarbonData will choose encoding method based on a default strategy. The default strategy is the same as current CarbonData 1.1 implementation, to ensure backward compatibility.

In future, this default strategy could also be changed if better strategy is found, for example, by heuristic rules based on data distribution rather than just data type.

3. Encoding cascading
Encoding TBLPROPERTY can be cascading, for example:
TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global dictionary firstly and then encoded as inverted index using the dictionary encode output.

Using this approach, user can control whether to do inverted index for each column.
This feature currently is mainly for inverted index, still need to explore whether it is suitable for all encoding methods.

4. Encoding extensibility
Besides the current supported encoding methods, we can make it extensible by developers. Developers can implement the encode/decode interface and provide it a short name with ‘_ENCODING’ suffix. For example:
TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above example.

Using this approach for extension, there are some potential new encoding that we can consider in future:
1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality is not so high so that we can do dictionary within one file.
2) BITMAP_ENCODE, for low cardinality column
3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by Facebook in Gorilla (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
4) XOR_ENCODE, for floating point measure, invented by Facebook in Gorilla

As in first development iteration, only native encoding will be support so that these new encoding should be added into CarbonData project. In second iteration, we can consider to open interface for 3rd party developer to add encoding outside of CarbonData project, maybe by providing encoding class name explicitly in another independent TBLPROPERTY option.

5. Improvement on storage and performance of high cardinality column
Ravindra has proposed some action item for non-dictionary encoding in above mentioned threads, to improve storage size and performance. They are still valid now and we should work on them along the work in this thread.

———— proposal ends

Please comment on this proposal focusing on:
1. Whether total design is clean or need improvement
2. Current me if wrong for the existing encoding methods. Encoding TBLPROPERTY option name is open for comment, you can suggest if have better one, especially for LV_BYTES_ENCODING (I am not feeling very confident with this one)
3. The idea of encoding cascading, make it work like this or we enumerate all encoding methods
4. You can suggest more potential encoding of your preference

Regards,
Jacky Li

Liang Chen

May 16, 2017; 9:54am

Re: [DISCUSSION] Encoding override and extensibility

Administrator

313 posts

Hi

This is a great discussion for further making "encoding functions" easier
use.

Expose all these options to users for different business cases, this is
good.But to be frank, it is difficult for general users to understand all
options and do an exact configuration.
So we need to consider more about "default option " or "default option
group" when designing solution.

For example : to set high cardinality column with ‘LV_BYTES_ENCODE’=‘C1’,
what is the default encoding behaviors if users don't set any option for
these columns?

Regards
Liang

2017-05-13 23:34 GMT-04:00 Jacky Li <[hidden email]>:

> For dictionary encoding related behavior, we had a discussion back in
> March:
> http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-
> no-dictionary-td8010.html <http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
> dimension-default-should-be-no-dictionary-td8010.html>
> http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html
> <http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html
> >
>
> From these two mail thread, we conclude that:
> 1. Initial idea of non-dictionary is only for high cardinality dimension
> column, they should not be the default encoding for all dimension columns.
> 2. While there are some suggestions in the mail thread to improve the
> usability of the DDL, we still need to find a way to make it simpler for
> user to control the encoding. So I propose a new solution here in this
> thread.
>
>
> The main goal of this proposal is to introduce new TBLPROPERTY to make it
> simpler to control the column encoding and also make it extensible by
> developers.
> Following is the proposal
>
> 1. Encoding override
> I propose to introduce a set of keyword in TBLPROPERTY to control encoding
> of each field in the table. The goal is to make it simpler for user to
> control the encoding.
> One keyword represent one encoding type. Currently we have three encoding
> type for dimension and two for measure:
>
> For dimension:
> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary
> encoding
> 2) LV_BYTES_ENCODE, for high cardinality string column and complex
> data type column, that are currently encoded as Length-Value encoded byte
> array
> 3) INVERTED_INDEX_ENCODE, for low cardinality column
>
> For measure:
> 1) DELTA_ENCODE: use delta encoding
> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
>
> User can control the encoding for example:
> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING, C5
> INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
> STORED BY carbondata
> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’,
> ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’,
> ‘DELTA_ENCODE’=‘C5’)
>
> In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global
> dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary), C5
> is encoded using Delta, and other columns (C6/C7) are encoded using default
> strategy.
> Using this approach, advantage is that:
> 1) express encoding independent with MDK columns, a requirement
> from community for long time.
> 2) compare the efficiency of certain encoding, by explicitly
> specify different encoding for the same field in two tables. This is
> required when exploring new encoding method.
>
> 2. Default strategy
> Using above keyword, user can override the encoding method for specific
> column. If user does not specify those keywords, CarbonData will choose
> encoding method based on a default strategy. The default strategy is the
> same as current CarbonData 1.1 implementation, to ensure backward
> compatibility.
>
> In future, this default strategy could also be changed if better strategy
> is found, for example, by heuristic rules based on data distribution rather
> than just data type.
>
> 3. Encoding cascading
> Encoding TBLPROPERTY can be cascading, for example:
> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’,
> ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global dictionary
> firstly and then encoded as inverted index using the dictionary encode
> output.
>
> Using this approach, user can control whether to do inverted index for
> each column.
> This feature currently is mainly for inverted index, still need to explore
> whether it is suitable for all encoding methods.
>
> 4. Encoding extensibility
> Besides the current supported encoding methods, we can make it extensible
> by developers. Developers can implement the encode/decode interface and
> provide it a short name with ‘_ENCODING’ suffix. For example:
> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above
> example.
>
> Using this approach for extension, there are some potential new encoding
> that we can consider in future:
> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality is
> not so high so that we can do dictionary within one file.
> 2) BITMAP_ENCODE, for low cardinality column
> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by
> Facebook in Gorilla (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <
> http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
> 4) XOR_ENCODE, for floating point measure, invented by Facebook in
> Gorilla
>
> As in first development iteration, only native encoding will be support so
> that these new encoding should be added into CarbonData project. In second
> iteration, we can consider to open interface for 3rd party developer to add
> encoding outside of CarbonData project, maybe by providing encoding class
> name explicitly in another independent TBLPROPERTY option.
>
> 5. Improvement on storage and performance of high cardinality column
> Ravindra has proposed some action item for non-dictionary encoding in
> above mentioned threads, to improve storage size and performance. They are
> still valid now and we should work on them along the work in this thread.
>
>
> ———— proposal ends
>
> Please comment on this proposal focusing on:
> 1. Whether total design is clean or need improvement
> 2. Current me if wrong for the existing encoding methods. Encoding
> TBLPROPERTY option name is open for comment, you can suggest if have better
> one, especially for LV_BYTES_ENCODING (I am not feeling very confident with
> this one)
> 3. The idea of encoding cascading, make it work like this or we enumerate
> all encoding methods
> 4. You can suggest more potential encoding of your preference
>
>
> Regards,
> Jacky Li
>
>
>

... [show rest of quote]

--
Regards
Liang

Jacky Li

May 16, 2017; 1:15pm

Re: [DISCUSSION] Encoding override and extensibility

228 posts

Hi,

I mentioned there will be a default strategy if user does not set any encoding options. For example, if user does not set encoding option for high cardinality dimension column, carbon will use default encoding which is LV_BYTES_ENCODE for this column.

Regards,
Jacky

> 在 2017年5月16日，下午5:54，Liang Chen <[hidden email]> 写道：
>
> Hi
>
> This is a great discussion for further making "encoding functions" easier
> use.
>
> Expose all these options to users for different business cases, this is
> good.But to be frank, it is difficult for general users to understand all
> options and do an exact configuration.
> So we need to consider more about "default option " or "default option
> group" when designing solution.
>
> For example : to set high cardinality column with ‘LV_BYTES_ENCODE’=‘C1’,
> what is the default encoding behaviors if users don't set any option for
> these columns?
>
> Regards
> Liang
>
> 2017-05-13 23:34 GMT-04:00 Jacky Li <[hidden email] <mailto:[hidden email]>>:
>
>> For dictionary encoding related behavior, we had a discussion back in
>> March:
>> http://apache-carbondata-dev-mailing-list-archive.1130556 <http://apache-carbondata-dev-mailing-list-archive.1130556/>.
>> n5.nabble.com/DISCUSS-For-the-dimension-default-should-be- <http://n5.nabble.com/DISCUSS-For-the-dimension-default-should-be->
>> no-dictionary-td8010.html <http://apache-carbondata-dev- <http://apache-carbondata-dev-/>
>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- <http://mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the->
>> dimension-default-should-be-no-dictionary-td8010.html>
>> http://apache-carbondata-dev-mailing-list-archive.1130556 <http://apache-carbondata-dev-mailing-list-archive.1130556/>.
>> n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>
>> <http://apache-carbondata-dev-mailing-list-archive.1130556 <http://apache-carbondata-dev-mailing-list-archive.1130556/>.
>> n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>
>>>
>>
>> From these two mail thread, we conclude that:
>> 1. Initial idea of non-dictionary is only for high cardinality dimension
>> column, they should not be the default encoding for all dimension columns.
>> 2. While there are some suggestions in the mail thread to improve the
>> usability of the DDL, we still need to find a way to make it simpler for
>> user to control the encoding. So I propose a new solution here in this
>> thread.
>>
>>
>> The main goal of this proposal is to introduce new TBLPROPERTY to make it
>> simpler to control the column encoding and also make it extensible by
>> developers.
>> Following is the proposal
>>
>> 1. Encoding override
>> I propose to introduce a set of keyword in TBLPROPERTY to control encoding
>> of each field in the table. The goal is to make it simpler for user to
>> control the encoding.
>> One keyword represent one encoding type. Currently we have three encoding
>> type for dimension and two for measure:
>>
>> For dimension:
>> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary
>> encoding
>> 2) LV_BYTES_ENCODE, for high cardinality string column and complex
>> data type column, that are currently encoded as Length-Value encoded byte
>> array
>> 3) INVERTED_INDEX_ENCODE, for low cardinality column
>>
>> For measure:
>> 1) DELTA_ENCODE: use delta encoding
>> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
>>
>> User can control the encoding for example:
>> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING, C5
>> INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
>> STORED BY carbondata
>> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’,
>> ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’,
>> ‘DELTA_ENCODE’=‘C5’)
>>
>> In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global
>> dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary), C5
>> is encoded using Delta, and other columns (C6/C7) are encoded using default
>> strategy.
>> Using this approach, advantage is that:
>> 1) express encoding independent with MDK columns, a requirement
>> from community for long time.
>> 2) compare the efficiency of certain encoding, by explicitly
>> specify different encoding for the same field in two tables. This is
>> required when exploring new encoding method.
>>
>> 2. Default strategy
>> Using above keyword, user can override the encoding method for specific
>> column. If user does not specify those keywords, CarbonData will choose
>> encoding method based on a default strategy. The default strategy is the
>> same as current CarbonData 1.1 implementation, to ensure backward
>> compatibility.
>>
>> In future, this default strategy could also be changed if better strategy
>> is found, for example, by heuristic rules based on data distribution rather
>> than just data type.
>>
>> 3. Encoding cascading
>> Encoding TBLPROPERTY can be cascading, for example:
>> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’,
>> ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global dictionary
>> firstly and then encoded as inverted index using the dictionary encode
>> output.
>>
>> Using this approach, user can control whether to do inverted index for
>> each column.
>> This feature currently is mainly for inverted index, still need to explore
>> whether it is suitable for all encoding methods.
>>
>> 4. Encoding extensibility
>> Besides the current supported encoding methods, we can make it extensible
>> by developers. Developers can implement the encode/decode interface and
>> provide it a short name with ‘_ENCODING’ suffix. For example:
>> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above
>> example.
>>
>> Using this approach for extension, there are some potential new encoding
>> that we can consider in future:
>> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality is
>> not so high so that we can do dictionary within one file.
>> 2) BITMAP_ENCODE, for low cardinality column
>> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by
>> Facebook in Gorilla (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <http://www.vldb.org/pvldb/vol8/p1816-teller.pdf> <
>> http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
>> 4) XOR_ENCODE, for floating point measure, invented by Facebook in
>> Gorilla
>>
>> As in first development iteration, only native encoding will be support so
>> that these new encoding should be added into CarbonData project. In second
>> iteration, we can consider to open interface for 3rd party developer to add
>> encoding outside of CarbonData project, maybe by providing encoding class
>> name explicitly in another independent TBLPROPERTY option.
>>
>> 5. Improvement on storage and performance of high cardinality column
>> Ravindra has proposed some action item for non-dictionary encoding in
>> above mentioned threads, to improve storage size and performance. They are
>> still valid now and we should work on them along the work in this thread.
>>
>>
>> ———— proposal ends
>>
>> Please comment on this proposal focusing on:
>> 1. Whether total design is clean or need improvement
>> 2. Current me if wrong for the existing encoding methods. Encoding
>> TBLPROPERTY option name is open for comment, you can suggest if have better
>> one, especially for LV_BYTES_ENCODING (I am not feeling very confident with
>> this one)
>> 3. The idea of encoding cascading, make it work like this or we enumerate
>> all encoding methods
>> 4. You can suggest more potential encoding of your preference
>>
>>
>> Regards,
>> Jacky Li
>>
>>
>>
>
>
> --
> Regards
> Liang

... [show rest of quote]

Liang Chen

May 16, 2017; 2:03pm

Re: [DISCUSSION] Encoding override and extensibility

Administrator

313 posts

Hi

I mean that we need to consider *enhancing the current default *for these
encoding options.
Not only put all these options to users.

Regards
Liang

2017-05-16 9:15 GMT-04:00 Jacky Li <[hidden email]>:

> Hi,
>
> I mentioned there will be a default strategy if user does not set any
> encoding options. For example, if user does not set encoding option for
> high cardinality dimension column, carbon will use default encoding which
> is LV_BYTES_ENCODE for this column.
>
> Regards,
> Jacky
>
> > 在 2017年5月16日，下午5:54，Liang Chen <[hidden email]> 写道：
> >
> > Hi
> >
> > This is a great discussion for further making "encoding functions" easier
> > use.
> >
> > Expose all these options to users for different business cases, this is
> > good.But to be frank, it is difficult for general users to understand
> all
> > options and do an exact configuration.
> > So we need to consider more about "default option " or "default option
> > group" when designing solution.
> >
> > For example : to set high cardinality column with ‘LV_BYTES_ENCODE’=‘C1’,
> > what is the default encoding behaviors if users don't set any option for
> > these columns?
> >
> > Regards
> > Liang
> >
> > 2017-05-13 23:34 GMT-04:00 Jacky Li <[hidden email] <mailto:
> [hidden email]>>:
> >
> >> For dictionary encoding related behavior, we had a discussion back in
> >> March:
> >> http://apache-carbondata-dev-mailing-list-archive.1130556 <
> http://apache-carbondata-dev-mailing-list-archive.1130556/>.
> >> n5.nabble.com/DISCUSS-For-the-dimension-default-should-be- <
> http://n5.nabble.com/DISCUSS-For-the-dimension-default-should-be->
> >> no-dictionary-td8010.html <http://apache-carbondata-dev- <
> http://apache-carbondata-dev-/>
> >> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- <
> http://mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the->
> >> dimension-default-should-be-no-dictionary-td8010.html>
> >> http://apache-carbondata-dev-mailing-list-archive.1130556 <
> http://apache-carbondata-dev-mailing-list-archive.1130556/>.
> >> n5.nabble.com/Improving-Non-dictionary-storage-amp-
> performance-td8146.html <http://n5.nabble.com/Improving-Non-dictionary-
> storage-amp-performance-td8146.html>
> >> <http://apache-carbondata-dev-mailing-list-archive.1130556 <
> http://apache-carbondata-dev-mailing-list-archive.1130556/>.
> >> n5.nabble.com/Improving-Non-dictionary-storage-amp-
> performance-td8146.html <http://n5.nabble.com/Improving-Non-dictionary-
> storage-amp-performance-td8146.html>
> >>>
> >>
> >> From these two mail thread, we conclude that:
> >> 1. Initial idea of non-dictionary is only for high cardinality dimension
> >> column, they should not be the default encoding for all dimension
> columns.
> >> 2. While there are some suggestions in the mail thread to improve the
> >> usability of the DDL, we still need to find a way to make it simpler for
> >> user to control the encoding. So I propose a new solution here in this
> >> thread.
> >>
> >>
> >> The main goal of this proposal is to introduce new TBLPROPERTY to make
> it
> >> simpler to control the column encoding and also make it extensible by
> >> developers.
> >> Following is the proposal
> >>
> >> 1. Encoding override
> >> I propose to introduce a set of keyword in TBLPROPERTY to control
> encoding
> >> of each field in the table. The goal is to make it simpler for user to
> >> control the encoding.
> >> One keyword represent one encoding type. Currently we have three
> encoding
> >> type for dimension and two for measure:
> >>
> >> For dimension:
> >> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary
> >> encoding
> >> 2) LV_BYTES_ENCODE, for high cardinality string column and
> complex
> >> data type column, that are currently encoded as Length-Value encoded
> byte
> >> array
> >> 3) INVERTED_INDEX_ENCODE, for low cardinality column
> >>
> >> For measure:
> >> 1) DELTA_ENCODE: use delta encoding
> >> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
> >>
> >> User can control the encoding for example:
> >> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING,
> C5
> >> INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
> >> STORED BY carbondata
> >> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’,
> >> ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’,
> >> ‘DELTA_ENCODE’=‘C5’)
> >>
> >> In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global
> >> dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary),
> C5
> >> is encoded using Delta, and other columns (C6/C7) are encoded using
> default
> >> strategy.
> >> Using this approach, advantage is that:
> >> 1) express encoding independent with MDK columns, a requirement
> >> from community for long time.
> >> 2) compare the efficiency of certain encoding, by explicitly
> >> specify different encoding for the same field in two tables. This is
> >> required when exploring new encoding method.
> >>
> >> 2. Default strategy
> >> Using above keyword, user can override the encoding method for specific
> >> column. If user does not specify those keywords, CarbonData will choose
> >> encoding method based on a default strategy. The default strategy is the
> >> same as current CarbonData 1.1 implementation, to ensure backward
> >> compatibility.
> >>
> >> In future, this default strategy could also be changed if better
> strategy
> >> is found, for example, by heuristic rules based on data distribution
> rather
> >> than just data type.
> >>
> >> 3. Encoding cascading
> >> Encoding TBLPROPERTY can be cascading, for example:
> >> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’,
> >> ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global
> dictionary
> >> firstly and then encoded as inverted index using the dictionary encode
> >> output.
> >>
> >> Using this approach, user can control whether to do inverted index for
> >> each column.
> >> This feature currently is mainly for inverted index, still need to
> explore
> >> whether it is suitable for all encoding methods.
> >>
> >> 4. Encoding extensibility
> >> Besides the current supported encoding methods, we can make it
> extensible
> >> by developers. Developers can implement the encode/decode interface and
> >> provide it a short name with ‘_ENCODING’ suffix. For example:
> >> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above
> >> example.
> >>
> >> Using this approach for extension, there are some potential new encoding
> >> that we can consider in future:
> >> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality
> is
> >> not so high so that we can do dictionary within one file.
> >> 2) BITMAP_ENCODE, for low cardinality column
> >> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by
> >> Facebook in Gorilla (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <
> http://www.vldb.org/pvldb/vol8/p1816-teller.pdf> <
> >> http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
> >> 4) XOR_ENCODE, for floating point measure, invented by Facebook
> in
> >> Gorilla
> >>
> >> As in first development iteration, only native encoding will be support
> so
> >> that these new encoding should be added into CarbonData project. In
> second
> >> iteration, we can consider to open interface for 3rd party developer to
> add
> >> encoding outside of CarbonData project, maybe by providing encoding
> class
> >> name explicitly in another independent TBLPROPERTY option.
> >>
> >> 5. Improvement on storage and performance of high cardinality column
> >> Ravindra has proposed some action item for non-dictionary encoding in
> >> above mentioned threads, to improve storage size and performance. They
> are
> >> still valid now and we should work on them along the work in this
> thread.
> >>
> >>
> >> ———— proposal ends
> >>
> >> Please comment on this proposal focusing on:
> >> 1. Whether total design is clean or need improvement
> >> 2. Current me if wrong for the existing encoding methods. Encoding
> >> TBLPROPERTY option name is open for comment, you can suggest if have
> better
> >> one, especially for LV_BYTES_ENCODING (I am not feeling very confident
> with
> >> this one)
> >> 3. The idea of encoding cascading, make it work like this or we
> enumerate
> >> all encoding methods
> >> 4. You can suggest more potential encoding of your preference
> >>
> >>
> >> Regards,
> >> Jacky Li
> >>
> >>
> >>
> >
> >
> > --
> > Regards
> > Liang
>
>

... [show rest of quote]

--
Regards
Liang

Jacky Li

May 17, 2017; 1:29am

Re: [DISCUSSION] Encoding override and extensibility

228 posts

In reply to this post by Jacky Li

Sure, I think we can refer to <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>>
There is a list of requirement we can plan to do.

Regards,
Jacky

> 在 2017年5月14日，上午11:34，Jacky Li <[hidden email]> 写道：
>
> For dictionary encoding related behavior, we had a discussion back in March:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-td8010.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-td8010.html>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-amp-performance-td8146.html>
>
> From these two mail thread, we conclude that:
> 1. Initial idea of non-dictionary is only for high cardinality dimension column, they should not be the default encoding for all dimension columns.
> 2. While there are some suggestions in the mail thread to improve the usability of the DDL, we still need to find a way to make it simpler for user to control the encoding. So I propose a new solution here in this thread.
>
>
> The main goal of this proposal is to introduce new TBLPROPERTY to make it simpler to control the column encoding and also make it extensible by developers.
> Following is the proposal
>
> 1. Encoding override
> I propose to introduce a set of keyword in TBLPROPERTY to control encoding of each field in the table. The goal is to make it simpler for user to control the encoding.
> One keyword represent one encoding type. Currently we have three encoding type for dimension and two for measure:
>
> For dimension:
> 1) GLOBAL_DICTIONARY_ENCODE, for table level global dictionary encoding
> 2) LV_BYTES_ENCODE, for high cardinality string column and complex data type column, that are currently encoded as Length-Value encoded byte array
> 3) INVERTED_INDEX_ENCODE, for low cardinality column
>
> For measure:
> 1) DELTA_ENCODE: use delta encoding
> 2) ADAPTIVE_ENCODE: encode value using adaptive data type.
>
> User can control the encoding for example:
> CREATE TABLE table (C1 STRING, C2 STRING, C3 STRING, C4 STRING, C5 INT, C6 INT, C7 STRING) // suppose C1 is high cardinality column
> STORED BY carbondata
> TBLPROPERTIES (‘SORT_COLUMNS’ = ‘C1, C3’, ‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘LV_BYTES_ENCODE’=‘C1’, ‘DELTA_ENCODE’=‘C5’)
>
> In this example, MDK is C1 and C3, C2/C3/C4 are encoded as global dictionary, C1 is high cardinality that uses LV_BYTES (no-dictionary), C5 is encoded using Delta, and other columns (C6/C7) are encoded using default strategy.
> Using this approach, advantage is that:
> 1) express encoding independent with MDK columns, a requirement from community for long time.
> 2) compare the efficiency of certain encoding, by explicitly specify different encoding for the same field in two tables. This is required when exploring new encoding method.
>
> 2. Default strategy
> Using above keyword, user can override the encoding method for specific column. If user does not specify those keywords, CarbonData will choose encoding method based on a default strategy. The default strategy is the same as current CarbonData 1.1 implementation, to ensure backward compatibility.
>
> In future, this default strategy could also be changed if better strategy is found, for example, by heuristic rules based on data distribution rather than just data type.
>
> 3. Encoding cascading
> Encoding TBLPROPERTY can be cascading, for example:
> TBLPROPERTIES (‘GLOBAL_DICTIONARY_ENCODE’=‘C2, C3, C4’, ‘INVERTED_INDEX_ENCODE’=‘C3’) means that C3 is encoded as global dictionary firstly and then encoded as inverted index using the dictionary encode output.
>
> Using this approach, user can control whether to do inverted index for each column.
> This feature currently is mainly for inverted index, still need to explore whether it is suitable for all encoding methods.
>
> 4. Encoding extensibility
> Besides the current supported encoding methods, we can make it extensible by developers. Developers can implement the encode/decode interface and provide it a short name with ‘_ENCODING’ suffix. For example:
> TBLPROPERTIES (‘BITMAP_ENCODING’=‘C7’) to encoding C7 as bitmap in above example.
>
> Using this approach for extension, there are some potential new encoding that we can consider in future:
> 1) LOCAL_DICTIONARY_ENCODE, for string column whose cardinality is not so high so that we can do dictionary within one file.
> 2) BITMAP_ENCODE, for low cardinality column
> 3) DELTA_OF_DELTA_ENCODE, for timestamp column, invented by Facebook in Gorilla (http://www.vldb.org/pvldb/vol8/p1816-teller.pdf <http://www.vldb.org/pvldb/vol8/p1816-teller.pdf>)
> 4) XOR_ENCODE, for floating point measure, invented by Facebook in Gorilla
>
> As in first development iteration, only native encoding will be support so that these new encoding should be added into CarbonData project. In second iteration, we can consider to open interface for 3rd party developer to add encoding outside of CarbonData project, maybe by providing encoding class name explicitly in another independent TBLPROPERTY option.
>
> 5. Improvement on storage and performance of high cardinality column
> Ravindra has proposed some action item for non-dictionary encoding in above mentioned threads, to improve storage size and performance. They are still valid now and we should work on them along the work in this thread.
>
>
> ———— proposal ends
>
> Please comment on this proposal focusing on:
> 1. Whether total design is clean or need improvement
> 2. Current me if wrong for the existing encoding methods. Encoding TBLPROPERTY option name is open for comment, you can suggest if have better one, especially for LV_BYTES_ENCODING (I am not feeling very confident with this one)
> 3. The idea of encoding cascading, make it work like this or we enumerate all encoding methods
> 4. You can suggest more potential encoding of your preference
>
>
> Regards,
> Jacky Li
>
>
>

... [show rest of quote]