Apache CarbonData Dev Mailing List archive

Re: [Discussion] Blocklet DataMap caching in driver

Posted by Jacky Li on Jun 22, 2018; 3:04pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Blocklet-DataMap-caching-in-driver-tp52582p52835.html

Hi Manish,

+1 for solution 1 for next carbon version. Solution 2 should be also considered, but for a future version after next version.

In my previous observation, many scenario user will filter on time range, and since Carbon’s segment is per incremental load which makes it related to time normally. So if we can have minmax for sort_columns for segment level, I think it will further help making driver index minimum. Will you also consider this?

Regards,
Jacky

> 在 2018年6月21日，下午5:24，manish gupta <[hidden email]> 写道：
>
> Hi Dev,
>
> The current implementation of Blocklet dataMap caching in driver is that it
> caches the min and max values of all the columns in schema by default.
>
> The problem with this implementation is that as the number of loads
> increases the memory required to hold min and max values also increases
> considerably. We know that in most of the scenarios there is a single
> driver and memory configured for driver is less as compared to executor.
> With continuos increase in memory requirement driver can even go out of
> memory which makes the situation further worse.
>
> *Proposed Solution to solve the above problem:*
>
> Carbondata uses min and max values for blocklet level pruning. It might not
> be necessary that user has filter on all the columns specified in the
> schema instead it could be only few columns that has filter applied on them
> in the query.
>
> 1. We provide user an option to cache the min and max values of only the
> required columns. Caching only the required columns can optimize the
> blocklet dataMap memory usage as well as solve the driver memory problem to
> a greater extent.
>
> 2. Using an external storage/DB to cache min and max values. We can also
> implement a solution to create a table in the external DB and store min and
> max values for all the columns in that table. This will not use any driver
> memory and hence the driver memory usage will be optimized further as
> compared to solution 1.
>
> *Solution 1* will not have any performance impact as the user will cache
> the required filter columns and it will not have any external dependency
> for query execution.
> *Solution 2* will degrade the query performance as it will involve querying
> for min and max values from external DB required for Blocklet pruning.
>
> *So from my point of view we should go with solution 1 and in near future
> propose a design for solution 2. User can have an option to select between
> the 2 options*. Kindly share your suggestions.
>
> Regards
> Manish Gupta