Hi Dev,
The current implementation of Blocklet dataMap caching in driver is that it caches the min and max values of all the columns in schema by default. The problem with this implementation is that as the number of loads increases the memory required to hold min and max values also increases considerably. We know that in most of the scenarios there is a single driver and memory configured for driver is less as compared to executor. With continuos increase in memory requirement driver can even go out of memory which makes the situation further worse. *Proposed Solution to solve the above problem:* Carbondata uses min and max values for blocklet level pruning. It might not be necessary that user has filter on all the columns specified in the schema instead it could be only few columns that has filter applied on them in the query. 1. We provide user an option to cache the min and max values of only the required columns. Caching only the required columns can optimize the blocklet dataMap memory usage as well as solve the driver memory problem to a greater extent. 2. Using an external storage/DB to cache min and max values. We can also implement a solution to create a table in the external DB and store min and max values for all the columns in that table. This will not use any driver memory and hence the driver memory usage will be optimized further as compared to solution 1. *Solution 1* will not have any performance impact as the user will cache the required filter columns and it will not have any external dependency for query execution. *Solution 2* will degrade the query performance as it will involve querying for min and max values from external DB required for Blocklet pruning. *So from my point of view we should go with solution 1 and in near future propose a design for solution 2. User can have an option to select between the 2 options*. Kindly share your suggestions. Regards Manish Gupta |
Hi Manish,
Thanks for proposing the solutions of driver memory problem. +1 for solution 1 but it may not be the complete solution. We should also have solution 2 to solve driver memory issue completely. I think in a very near feature we should have solution 2 as well. I have few doubts and suggestions related to solution 1. 1. what if the query comes on noncached columns, will it start read from disk in driver side for minmax ? 2. Are we planning to cache blocklet level information or block level information in driver side for cached columns? 3. What is the impact if we automatically chose cached columns from the user query instead of letting the user configure them? Regards, Ravindra. On Thu, 21 Jun 2018 at 14:54, manish gupta <[hidden email]> wrote: > Hi Dev, > > The current implementation of Blocklet dataMap caching in driver is that it > caches the min and max values of all the columns in schema by default. > > The problem with this implementation is that as the number of loads > increases the memory required to hold min and max values also increases > considerably. We know that in most of the scenarios there is a single > driver and memory configured for driver is less as compared to executor. > With continuos increase in memory requirement driver can even go out of > memory which makes the situation further worse. > > *Proposed Solution to solve the above problem:* > > Carbondata uses min and max values for blocklet level pruning. It might not > be necessary that user has filter on all the columns specified in the > schema instead it could be only few columns that has filter applied on them > in the query. > > 1. We provide user an option to cache the min and max values of only the > required columns. Caching only the required columns can optimize the > blocklet dataMap memory usage as well as solve the driver memory problem to > a greater extent. > > 2. Using an external storage/DB to cache min and max values. We can also > implement a solution to create a table in the external DB and store min and > max values for all the columns in that table. This will not use any driver > memory and hence the driver memory usage will be optimized further as > compared to solution 1. > > *Solution 1* will not have any performance impact as the user will cache > the required filter columns and it will not have any external dependency > for query execution. > *Solution 2* will degrade the query performance as it will involve querying > for min and max values from external DB required for Blocklet pruning. > > *So from my point of view we should go with solution 1 and in near future > propose a design for solution 2. User can have an option to select between > the 2 options*. Kindly share your suggestions. > > Regards > Manish Gupta > -- Thanks & Regards, Ravi |
Thanks Ravi for the feedback. I completely agree with you that we need to
develop the second solution ASAP. Please find my response below for your queries. 1. what if the query comes on noncached columns, will it start read from disk in driver side for minmax ? - If query is on a non-cached column then all the blocks will be selected and min/max pruning will be done in each executor. In driver side there will not be any read as it is a single process and it will increase the pruning time if for every query min/max values are read from disk. So I feel it is better to read in distributed way using the executors. 2. Are we planning to cache blocklet level information or block level information in driver side for cached columns? - We will provide an option to user to cache at Block or Blocklet level. It will be configurable at table level and default caching will be at Block level. I will cover this part in detail in the design document. 3. What is the impact if we automatically chose cached columns from the user query instead of letting the user configure them? - Every query can have different filter columns. So if we choose automatically then for every different column it will read from disk and load into cache. This can be more cumbersome and query time can vary unexpectedly which may not be justifiable. So I feel it is better to let user to decide which columns to be cached. Let me know for any more clarifications. Regards Manish Gupta -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Manish,
Thanks for proposing configured columns for min max cache. This will help customers who has large data but only few columns are used for filter condition. +1 for the solution 1. Regards, Kanaka On Fri, Jun 22, 2018 at 11:39 AM, manishgupta88 <[hidden email]> wrote: > Thanks Ravi for the feedback. I completely agree with you that we need to > develop the second solution ASAP. Please find my response below for your > queries. > > 1. what if the query comes on noncached columns, will it start read from > disk in driver side for minmax ? > - If query is on a non-cached column then all the blocks will be selected > and min/max pruning will be done in each executor. In driver side there > will > not be any read as it is a single process and it will increase the pruning > time if for every query min/max values are read from disk. So I feel it is > better to read in distributed way using the executors. > > 2. Are we planning to cache blocklet level information or block level > information in driver side for cached columns? > - We will provide an option to user to cache at Block or Blocklet level. It > will be configurable at table level and default caching will be at Block > level. I will cover this part in detail in the design document. > > 3. What is the impact if we automatically chose cached columns from the > user query instead of letting the user configure them? > - Every query can have different filter columns. So if we choose > automatically then for every different column it will read from disk and > load into cache. This can be more cumbersome and query time can vary > unexpectedly which may not be justifiable. So I feel it is better to let > user to decide which columns to be cached. > > Let me know for any more clarifications. > > Regards > Manish Gupta > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/ >
-
Kanaka |
In reply to this post by manishgupta88
Hi Manish,
+1 for solution 1 for next carbon version. Solution 2 should be also considered, but for a future version after next version. In my previous observation, many scenario user will filter on time range, and since Carbon’s segment is per incremental load which makes it related to time normally. So if we can have minmax for sort_columns for segment level, I think it will further help making driver index minimum. Will you also consider this? Regards, Jacky > 在 2018年6月21日,下午5:24,manish gupta <[hidden email]> 写道: > > Hi Dev, > > The current implementation of Blocklet dataMap caching in driver is that it > caches the min and max values of all the columns in schema by default. > > The problem with this implementation is that as the number of loads > increases the memory required to hold min and max values also increases > considerably. We know that in most of the scenarios there is a single > driver and memory configured for driver is less as compared to executor. > With continuos increase in memory requirement driver can even go out of > memory which makes the situation further worse. > > *Proposed Solution to solve the above problem:* > > Carbondata uses min and max values for blocklet level pruning. It might not > be necessary that user has filter on all the columns specified in the > schema instead it could be only few columns that has filter applied on them > in the query. > > 1. We provide user an option to cache the min and max values of only the > required columns. Caching only the required columns can optimize the > blocklet dataMap memory usage as well as solve the driver memory problem to > a greater extent. > > 2. Using an external storage/DB to cache min and max values. We can also > implement a solution to create a table in the external DB and store min and > max values for all the columns in that table. This will not use any driver > memory and hence the driver memory usage will be optimized further as > compared to solution 1. > > *Solution 1* will not have any performance impact as the user will cache > the required filter columns and it will not have any external dependency > for query execution. > *Solution 2* will degrade the query performance as it will involve querying > for min and max values from external DB required for Blocklet pruning. > > *So from my point of view we should go with solution 1 and in near future > propose a design for solution 2. User can have an option to select between > the 2 options*. Kindly share your suggestions. > > Regards > Manish Gupta |
Thanks for the feedback Jacky.
As of now we have min/max at each block and blocklet level and while loading the metadata cache we compute the task level min/max. Segment Level min/max is not considered as of now but surely this solution can be enhanced to consider segment level min/max. We can discuss further on this in detail and decide whether to consider now or enhance it in near future. Regards Manish Gupta On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <[hidden email]> wrote: > Hi Manish, > > +1 for solution 1 for next carbon version. Solution 2 should be also > considered, but for a future version after next version. > > In my previous observation, many scenario user will filter on time range, > and since Carbon’s segment is per incremental load which makes it related > to time normally. So if we can have minmax for sort_columns for segment > level, I think it will further help making driver index minimum. Will you > also consider this? > > Regards, > Jacky > > > > 在 2018年6月21日,下午5:24,manish gupta <[hidden email]> 写道: > > > > Hi Dev, > > > > The current implementation of Blocklet dataMap caching in driver is that > it > > caches the min and max values of all the columns in schema by default. > > > > The problem with this implementation is that as the number of loads > > increases the memory required to hold min and max values also increases > > considerably. We know that in most of the scenarios there is a single > > driver and memory configured for driver is less as compared to executor. > > With continuos increase in memory requirement driver can even go out of > > memory which makes the situation further worse. > > > > *Proposed Solution to solve the above problem:* > > > > Carbondata uses min and max values for blocklet level pruning. It might > not > > be necessary that user has filter on all the columns specified in the > > schema instead it could be only few columns that has filter applied on > them > > in the query. > > > > 1. We provide user an option to cache the min and max values of only the > > required columns. Caching only the required columns can optimize the > > blocklet dataMap memory usage as well as solve the driver memory problem > to > > a greater extent. > > > > 2. Using an external storage/DB to cache min and max values. We can also > > implement a solution to create a table in the external DB and store min > and > > max values for all the columns in that table. This will not use any > driver > > memory and hence the driver memory usage will be optimized further as > > compared to solution 1. > > > > *Solution 1* will not have any performance impact as the user will cache > > the required filter columns and it will not have any external dependency > > for query execution. > > *Solution 2* will degrade the query performance as it will involve > querying > > for min and max values from external DB required for Blocklet pruning. > > > > *So from my point of view we should go with solution 1 and in near future > > propose a design for solution 2. User can have an option to select > between > > the 2 options*. Kindly share your suggestions. > > > > Regards > > Manish Gupta > > > > |
Hi Dev,
I have worked on the design document. Please find below the link for design document and share your feedback. https://drive.google.com/open?id=1lN06Pj5tBiBIPSxOBIjK9bpbFVhlUoQA I have also raised the jira issue and uploaded the design document. Please find below the jira link. https://issues.apache.org/jira/browse/CARBONDATA-2638 Regards Manish Gupta On Sat, Jun 23, 2018 at 7:40 PM, manish gupta <[hidden email]> wrote: > Thanks for the feedback Jacky. > > As of now we have min/max at each block and blocklet level and while > loading the metadata cache we compute the task level min/max. Segment Level > min/max is not considered as of now but surely this solution can be > enhanced to consider segment level min/max. > > We can discuss further on this in detail and decide whether to consider > now or enhance it in near future. > > Regards > Manish Gupta > > On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <[hidden email]> wrote: > >> Hi Manish, >> >> +1 for solution 1 for next carbon version. Solution 2 should be also >> considered, but for a future version after next version. >> >> In my previous observation, many scenario user will filter on time range, >> and since Carbon’s segment is per incremental load which makes it related >> to time normally. So if we can have minmax for sort_columns for segment >> level, I think it will further help making driver index minimum. Will you >> also consider this? >> >> Regards, >> Jacky >> >> >> > 在 2018年6月21日,下午5:24,manish gupta <[hidden email]> 写道: >> > >> > Hi Dev, >> > >> > The current implementation of Blocklet dataMap caching in driver is >> that it >> > caches the min and max values of all the columns in schema by default. >> > >> > The problem with this implementation is that as the number of loads >> > increases the memory required to hold min and max values also increases >> > considerably. We know that in most of the scenarios there is a single >> > driver and memory configured for driver is less as compared to executor. >> > With continuos increase in memory requirement driver can even go out of >> > memory which makes the situation further worse. >> > >> > *Proposed Solution to solve the above problem:* >> > >> > Carbondata uses min and max values for blocklet level pruning. It might >> not >> > be necessary that user has filter on all the columns specified in the >> > schema instead it could be only few columns that has filter applied on >> them >> > in the query. >> > >> > 1. We provide user an option to cache the min and max values of only the >> > required columns. Caching only the required columns can optimize the >> > blocklet dataMap memory usage as well as solve the driver memory >> problem to >> > a greater extent. >> > >> > 2. Using an external storage/DB to cache min and max values. We can also >> > implement a solution to create a table in the external DB and store min >> and >> > max values for all the columns in that table. This will not use any >> driver >> > memory and hence the driver memory usage will be optimized further as >> > compared to solution 1. >> > >> > *Solution 1* will not have any performance impact as the user will cache >> > the required filter columns and it will not have any external dependency >> > for query execution. >> > *Solution 2* will degrade the query performance as it will involve >> querying >> > for min and max values from external DB required for Blocklet pruning. >> > >> > *So from my point of view we should go with solution 1 and in near >> future >> > propose a design for solution 2. User can have an option to select >> between >> > the 2 options*. Kindly share your suggestions. >> > >> > Regards >> > Manish Gupta >> >> >> >> > |
Free forum by Nabble | Edit this page |