Apache CarbonData Dev Mailing List archive

[DISCUSSION] Optimizing the writing of min max for a column

Classic

List

Threaded

7 messages Options

manishgupta88

[DISCUSSION] Optimizing the writing of min max for a column

Hi Dev

I am currenlty working on min max optimization whereIn for string/varhcar
data types column we will decide internally whether to write min max or not.

*Background*
Currently we are storing min max for all the columns. Currently we are
storing page min max, blocklet min max in filefooter and all the blocklet
metadata entries in the shard. Consider the case where each column data
size is more than 10000 characters. In this case if we write min max then
min max will be written 3 times for each column and it will lead to
increase in store size which will impact the query performance.

*Design proposal*
1. We will introduce a configurable system level property for max
characters *"carbon.string.allowed.character.count".* If the data crosses
this limit then min max will not be stored for that column.
2. If a page does not contain min max for a column, then blocklet min max
will also not contain the entry for min max of that column.
3. Thrift file will be modified to introduce a option Boolean flag which
will used in query to identify whether min max is stored for the filter
column or not.
4. As of now it will be supported only for dimensions of string/varchar
type. We can extend it further to support bigDecimal type measures also in
future if required.
5. Block and blocklet dataMap cache will also include storing min max
Boolean flag for dimensions column based on which filter pruning will be
done. If min max is not written for any column then isScanRequired will
return true in driver pruning.
6. In executor again page and blocklet level min max will be checked for
filter column. If min max is not written then complete page data will be
scanned.

*Backward compatibility*
1. For stores prior to 1.5.0 min max flag for all the columns will be set
to true during loading dataMap in query flow.

Please feel free to share your inputs and suggestions.

Regards
Manish Gupta

ravipesala

Re: [DISCUSSION] Optimizing the writing of min max for a column

+1

It is essential feature in case of big strings . We should not store
min/max for large text columns as it increases storage.

Regards,
Ravindra

On Sat, 15 Sep 2018 at 12:14 PM, manish gupta <[hidden email]>
wrote:

> Hi Dev
>
> I am currenlty working on min max optimization whereIn for string/varhcar
> data types column we will decide internally whether to write min max or
> not.
>
> *Background*
> Currently we are storing min max for all the columns. Currently we are
> storing page min max, blocklet min max in filefooter and all the blocklet
> metadata entries in the shard. Consider the case where each column data
> size is more than 10000 characters. In this case if we write min max then
> min max will be written 3 times for each column and it will lead to
> increase in store size which will impact the query performance.
>
> *Design proposal*
> 1. We will introduce a configurable system level property for max
> characters *"carbon.string.allowed.character.count".* If the data crosses
> this limit then min max will not be stored for that column.
> 2. If a page does not contain min max for a column, then blocklet min max
> will also not contain the entry for min max of that column.
> 3. Thrift file will be modified to introduce a option Boolean flag which
> will used in query to identify whether min max is stored for the filter
> column or not.
> 4. As of now it will be supported only for dimensions of string/varchar
> type. We can extend it further to support bigDecimal type measures also in
> future if required.
> 5. Block and blocklet dataMap cache will also include storing min max
> Boolean flag for dimensions column based on which filter pruning will be
> done. If min max is not written for any column then isScanRequired will
> return true in driver pruning.
> 6. In executor again page and blocklet level min max will be checked for
> filter column. If min max is not written then complete page data will be
> scanned.
>
> *Backward compatibility*
> 1. For stores prior to 1.5.0 min max flag for all the columns will be set
> to true during loading dataMap in query flow.
>
> Please feel free to share your inputs and suggestions.
>
> Regards
> Manish Gupta
>

xuchuanyin

Re: [DISCUSSION] Optimizing the writing of min max for a column

In reply to this post by manishgupta88

What is the default value of the property ‘carbon.string.allowed.character.count’ ?

Actually many IDs are string, as a result I think we can make it a reasonable value so that it will not affect the behavior of common usage.

manishgupta88

Re: [DISCUSSION] Optimizing the writing of min max for a column

Hi Xuchuanyin

Please find below the details for the property
‘carbon.string.allowed.character.count’.

*Property name*

*Default value*

*Max value*

*Min value*

*carbon.string.allowed.character.count*

500

1000

10

Regards
Manish Gupta

On Sun, Sep 16, 2018 at 9:32 AM xuchuanyin <[hidden email]> wrote:

> What is the default value of the property
> ‘carbon.string.allowed.character.count’ ?
>
> Actually many IDs are string, as a result I think we can make it a
> reasonable value so that it will not affect the behavior of common usage.

manishgupta88

Re: [DISCUSSION] Optimizing the writing of min max for a column

Hi Dev

After discussion with PMC members the property name is modified to
'carbon.minmax.allowed.byte.count' and below is the list of updated
configurations.

Default value: 200 bytes (100 characters)
Max value: 1000 bytes (500 characters)
Min value: 10 bytes (5 characters)

Regards
Manish Gupta

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Re: [DISCUSSION] Optimizing the writing of min max for a column

Hi manish, after reviewing your PR, I do come across another idea of
implementing this feature：

I'd like to use an unique `FAKE` value to store the minmax for those
columns.
While judging the filter value with the fake minmax, we know that is a fake
minmax, so the filter procedure just returns true.
In this way, we do not need to modify the metadata and can reduce many code
changes.

BTW, It seems the we already use a unique value for 'NULL' during
converting, so I think a unique value for fake minmax is acceptable.

How do you think about it?

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

manishgupta88

Re: [DISCUSSION] Optimizing the writing of min max for a column

Hi Xuchuanyin

The idea you have mentioned is good and correct. But I feel that the current
implementation behavior is better because of the following reasons.

1. Code understanding will be good as per the current implementation.
Looking at the thrift anyone can understand the design and come to know that
it has a boolean flag to say whether min max is stored for a particular
column. This will be even helpful for Carbon CLI tool where we can display
whether min max for a column is stored or not without comparing the min and
max values for all columns with FAKE data.
2. It is difficult to decide on the FAKE value. Any value which we decide as
the FAKE value will become a data limitation. In the near future we will be
extending this feature support for binary type also.
3. The comparison of a boolean flag will be much faster as compared to FAKE
value byte comparison.
4. The memory space required for storing boolean flag will be negligible as
we are already saving the space when the specified byte limit is reached by
storing 0 length byte as min and max.

Regards
Manish Gupta

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/