Re: [DISCUSSION] Optimizing the writing of min max for a column
Posted by
ravipesala on
Sep 16, 2018; 2:37am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Optimizing-the-writing-of-min-max-for-a-column-tp62515p62545.html
+1
It is essential feature in case of big strings . We should not store
min/max for large text columns as it increases storage.
Regards,
Ravindra
On Sat, 15 Sep 2018 at 12:14 PM, manish gupta <
[hidden email]>
wrote:
> Hi Dev
>
> I am currenlty working on min max optimization whereIn for string/varhcar
> data types column we will decide internally whether to write min max or
> not.
>
> *Background*
> Currently we are storing min max for all the columns. Currently we are
> storing page min max, blocklet min max in filefooter and all the blocklet
> metadata entries in the shard. Consider the case where each column data
> size is more than 10000 characters. In this case if we write min max then
> min max will be written 3 times for each column and it will lead to
> increase in store size which will impact the query performance.
>
> *Design proposal*
> 1. We will introduce a configurable system level property for max
> characters *"carbon.string.allowed.character.count".* If the data crosses
> this limit then min max will not be stored for that column.
> 2. If a page does not contain min max for a column, then blocklet min max
> will also not contain the entry for min max of that column.
> 3. Thrift file will be modified to introduce a option Boolean flag which
> will used in query to identify whether min max is stored for the filter
> column or not.
> 4. As of now it will be supported only for dimensions of string/varchar
> type. We can extend it further to support bigDecimal type measures also in
> future if required.
> 5. Block and blocklet dataMap cache will also include storing min max
> Boolean flag for dimensions column based on which filter pruning will be
> done. If min max is not written for any column then isScanRequired will
> return true in driver pruning.
> 6. In executor again page and blocklet level min max will be checked for
> filter column. If min max is not written then complete page data will be
> scanned.
>
> *Backward compatibility*
> 1. For stores prior to 1.5.0 min max flag for all the columns will be set
> to true during loading dataMap in query flow.
>
> Please feel free to share your inputs and suggestions.
>
> Regards
> Manish Gupta
>