[Discussion] Blocklet DataMap caching in driver
Posted by manishgupta88 on Jun 21, 2018; 9:24am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Blocklet-DataMap-caching-in-driver-tp52582.html
Hi Dev,
The current implementation of Blocklet dataMap caching in driver is that it
caches the min and max values of all the columns in schema by default.
The problem with this implementation is that as the number of loads
increases the memory required to hold min and max values also increases
considerably. We know that in most of the scenarios there is a single
driver and memory configured for driver is less as compared to executor.
With continuos increase in memory requirement driver can even go out of
memory which makes the situation further worse.
*Proposed Solution to solve the above problem:*
Carbondata uses min and max values for blocklet level pruning. It might not
be necessary that user has filter on all the columns specified in the
schema instead it could be only few columns that has filter applied on them
in the query.
1. We provide user an option to cache the min and max values of only the
required columns. Caching only the required columns can optimize the
blocklet dataMap memory usage as well as solve the driver memory problem to
a greater extent.
2. Using an external storage/DB to cache min and max values. We can also
implement a solution to create a table in the external DB and store min and
max values for all the columns in that table. This will not use any driver
memory and hence the driver memory usage will be optimized further as
compared to solution 1.
*Solution 1* will not have any performance impact as the user will cache
the required filter columns and it will not have any external dependency
for query execution.
*Solution 2* will degrade the query performance as it will involve querying
for min and max values from external DB required for Blocklet pruning.
*So from my point of view we should go with solution 1 and in near future
propose a design for solution 2. User can have an option to select between
the 2 options*. Kindly share your suggestions.
Regards
Manish Gupta