Apache CarbonData Dev Mailing List archive

[Discussion] Blocklet DataMap caching in driver

Classic

List

Threaded

7 messages Options

manishgupta88

[Discussion] Blocklet DataMap caching in driver

Hi Dev,

The current implementation of Blocklet dataMap caching in driver is that it
caches the min and max values of all the columns in schema by default.

The problem with this implementation is that as the number of loads
increases the memory required to hold min and max values also increases
considerably. We know that in most of the scenarios there is a single
driver and memory configured for driver is less as compared to executor.
With continuos increase in memory requirement driver can even go out of
memory which makes the situation further worse.

*Proposed Solution to solve the above problem:*

Carbondata uses min and max values for blocklet level pruning. It might not
be necessary that user has filter on all the columns specified in the
schema instead it could be only few columns that has filter applied on them
in the query.

1. We provide user an option to cache the min and max values of only the
required columns. Caching only the required columns can optimize the
blocklet dataMap memory usage as well as solve the driver memory problem to
a greater extent.

2. Using an external storage/DB to cache min and max values. We can also
implement a solution to create a table in the external DB and store min and
max values for all the columns in that table. This will not use any driver
memory and hence the driver memory usage will be optimized further as
compared to solution 1.

*Solution 1* will not have any performance impact as the user will cache
the required filter columns and it will not have any external dependency
for query execution.
*Solution 2* will degrade the query performance as it will involve querying
for min and max values from external DB required for Blocklet pruning.

*So from my point of view we should go with solution 1 and in near future
propose a design for solution 2. User can have an option to select between
the 2 options*. Kindly share your suggestions.

Regards
Manish Gupta

ravipesala

Re: [Discussion] Blocklet DataMap caching in driver

Hi Manish,
Thanks for proposing the solutions of driver memory problem.

+1 for solution 1 but it may not be the complete solution. We should also
have solution 2 to solve driver memory issue completely. I think in a very
near feature we should have solution 2 as well.

I have few doubts and suggestions related to solution 1.
1. what if the query comes on noncached columns, will it start read from
disk in driver side for minmax ?
2. Are we planning to cache blocklet level information or block level
information in driver side for cached columns?
3. What is the impact if we automatically chose cached columns from the
user query instead of letting the user configure them?

Regards,
Ravindra.

On Thu, 21 Jun 2018 at 14:54, manish gupta <[hidden email]>
wrote:

> Hi Dev,
>
> The current implementation of Blocklet dataMap caching in driver is that it
> caches the min and max values of all the columns in schema by default.
>
> The problem with this implementation is that as the number of loads
> increases the memory required to hold min and max values also increases
> considerably. We know that in most of the scenarios there is a single
> driver and memory configured for driver is less as compared to executor.
> With continuos increase in memory requirement driver can even go out of
> memory which makes the situation further worse.
>
> *Proposed Solution to solve the above problem:*
>
> Carbondata uses min and max values for blocklet level pruning. It might not
> be necessary that user has filter on all the columns specified in the
> schema instead it could be only few columns that has filter applied on them
> in the query.
>
> 1. We provide user an option to cache the min and max values of only the
> required columns. Caching only the required columns can optimize the
> blocklet dataMap memory usage as well as solve the driver memory problem to
> a greater extent.
>
> 2. Using an external storage/DB to cache min and max values. We can also
> implement a solution to create a table in the external DB and store min and
> max values for all the columns in that table. This will not use any driver
> memory and hence the driver memory usage will be optimized further as
> compared to solution 1.
>
> *Solution 1* will not have any performance impact as the user will cache
> the required filter columns and it will not have any external dependency
> for query execution.
> *Solution 2* will degrade the query performance as it will involve querying
> for min and max values from external DB required for Blocklet pruning.
>
> *So from my point of view we should go with solution 1 and in near future
> propose a design for solution 2. User can have an option to select between
> the 2 options*. Kindly share your suggestions.
>
> Regards
> Manish Gupta
>

--
Thanks & Regards,
Ravi

manishgupta88

Re: [Discussion] Blocklet DataMap caching in driver

Thanks Ravi for the feedback. I completely agree with you that we need to
develop the second solution ASAP. Please find my response below for your
queries.

1. what if the query comes on noncached columns, will it start read from
disk in driver side for minmax ?
- If query is on a non-cached column then all the blocks will be selected
and min/max pruning will be done in each executor. In driver side there will
not be any read as it is a single process and it will increase the pruning
time if for every query min/max values are read from disk. So I feel it is
better to read in distributed way using the executors.

2. Are we planning to cache blocklet level information or block level
information in driver side for cached columns?
- We will provide an option to user to cache at Block or Blocklet level. It
will be configurable at table level and default caching will be at Block
level. I will cover this part in detail in the design document.

3. What is the impact if we automatically chose cached columns from the
user query instead of letting the user configure them?
- Every query can have different filter columns. So if we choose
automatically then for every different column it will read from disk and
load into cache. This can be more cumbersome and query time can vary
unexpectedly which may not be justifiable. So I feel it is better to let
user to decide which columns to be cached.

Let me know for any more clarifications.

Regards
Manish Gupta

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

kanaka

Re: [Discussion] Blocklet DataMap caching in driver

Hi Manish,

Thanks for proposing configured columns for min max cache. This will help
customers who has large data but only few columns are used for filter
condition.
+1 for the solution 1.

Regards,
Kanaka

On Fri, Jun 22, 2018 at 11:39 AM, manishgupta88 <[hidden email]>
wrote:

> Thanks Ravi for the feedback. I completely agree with you that we need to
> develop the second solution ASAP. Please find my response below for your
> queries.
>
> 1. what if the query comes on noncached columns, will it start read from
> disk in driver side for minmax ?
> - If query is on a non-cached column then all the blocks will be selected
> and min/max pruning will be done in each executor. In driver side there
> will
> not be any read as it is a single process and it will increase the pruning
> time if for every query min/max values are read from disk. So I feel it is
> better to read in distributed way using the executors.
>
> 2. Are we planning to cache blocklet level information or block level
> information in driver side for cached columns?
> - We will provide an option to user to cache at Block or Blocklet level. It
> will be configurable at table level and default caching will be at Block
> level. I will cover this part in detail in the design document.
>
> 3. What is the impact if we automatically chose cached columns from the
> user query instead of letting the user configure them?
> - Every query can have different filter columns. So if we choose
> automatically then for every different column it will read from disk and
> load into cache. This can be more cumbersome and query time can vary
> unexpectedly which may not be justifiable. So I feel it is better to let
> user to decide which columns to be cached.
>
> Let me know for any more clarifications.
>
> Regards
> Manish Gupta
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

-
Kanaka

Jacky Li

Re: [Discussion] Blocklet DataMap caching in driver

In reply to this post by manishgupta88

Hi Manish,

+1 for solution 1 for next carbon version. Solution 2 should be also considered, but for a future version after next version.

In my previous observation, many scenario user will filter on time range, and since Carbon’s segment is per incremental load which makes it related to time normally. So if we can have minmax for sort_columns for segment level, I think it will further help making driver index minimum. Will you also consider this?

Regards,
Jacky

> 在 2018年6月21日，下午5:24，manish gupta <[hidden email]> 写道：
>
> Hi Dev,
>
> The current implementation of Blocklet dataMap caching in driver is that it
> caches the min and max values of all the columns in schema by default.
>
> The problem with this implementation is that as the number of loads
> increases the memory required to hold min and max values also increases
> considerably. We know that in most of the scenarios there is a single
> driver and memory configured for driver is less as compared to executor.
> With continuos increase in memory requirement driver can even go out of
> memory which makes the situation further worse.
>
> *Proposed Solution to solve the above problem:*
>
> Carbondata uses min and max values for blocklet level pruning. It might not
> be necessary that user has filter on all the columns specified in the
> schema instead it could be only few columns that has filter applied on them
> in the query.
>
> 1. We provide user an option to cache the min and max values of only the
> required columns. Caching only the required columns can optimize the
> blocklet dataMap memory usage as well as solve the driver memory problem to
> a greater extent.
>
> 2. Using an external storage/DB to cache min and max values. We can also
> implement a solution to create a table in the external DB and store min and
> max values for all the columns in that table. This will not use any driver
> memory and hence the driver memory usage will be optimized further as
> compared to solution 1.
>
> *Solution 1* will not have any performance impact as the user will cache
> the required filter columns and it will not have any external dependency
> for query execution.
> *Solution 2* will degrade the query performance as it will involve querying
> for min and max values from external DB required for Blocklet pruning.
>
> *So from my point of view we should go with solution 1 and in near future
> propose a design for solution 2. User can have an option to select between
> the 2 options*. Kindly share your suggestions.
>
> Regards
> Manish Gupta

manishgupta88

Re: [Discussion] Blocklet DataMap caching in driver

Thanks for the feedback Jacky.

As of now we have min/max at each block and blocklet level and while
loading the metadata cache we compute the task level min/max. Segment Level
min/max is not considered as of now but surely this solution can be
enhanced to consider segment level min/max.

We can discuss further on this in detail and decide whether to consider now
or enhance it in near future.

Regards
Manish Gupta

On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <[hidden email]> wrote:

> Hi Manish,
>
> +1 for solution 1 for next carbon version. Solution 2 should be also
> considered, but for a future version after next version.
>
> In my previous observation, many scenario user will filter on time range,
> and since Carbon’s segment is per incremental load which makes it related
> to time normally. So if we can have minmax for sort_columns for segment
> level, I think it will further help making driver index minimum. Will you
> also consider this?
>
> Regards,
> Jacky
>
>
> > 在 2018年6月21日，下午5:24，manish gupta <[hidden email]> 写道：
> >
> > Hi Dev,
> >
> > The current implementation of Blocklet dataMap caching in driver is that
> it
> > caches the min and max values of all the columns in schema by default.
> >
> > The problem with this implementation is that as the number of loads
> > increases the memory required to hold min and max values also increases
> > considerably. We know that in most of the scenarios there is a single
> > driver and memory configured for driver is less as compared to executor.
> > With continuos increase in memory requirement driver can even go out of
> > memory which makes the situation further worse.
> >
> > *Proposed Solution to solve the above problem:*
> >
> > Carbondata uses min and max values for blocklet level pruning. It might
> not
> > be necessary that user has filter on all the columns specified in the
> > schema instead it could be only few columns that has filter applied on
> them
> > in the query.
> >
> > 1. We provide user an option to cache the min and max values of only the
> > required columns. Caching only the required columns can optimize the
> > blocklet dataMap memory usage as well as solve the driver memory problem
> to
> > a greater extent.
> >
> > 2. Using an external storage/DB to cache min and max values. We can also
> > implement a solution to create a table in the external DB and store min
> and
> > max values for all the columns in that table. This will not use any
> driver
> > memory and hence the driver memory usage will be optimized further as
> > compared to solution 1.
> >
> > *Solution 1* will not have any performance impact as the user will cache
> > the required filter columns and it will not have any external dependency
> > for query execution.
> > *Solution 2* will degrade the query performance as it will involve
> querying
> > for min and max values from external DB required for Blocklet pruning.
> >
> > *So from my point of view we should go with solution 1 and in near future
> > propose a design for solution 2. User can have an option to select
> between
> > the 2 options*. Kindly share your suggestions.
> >
> > Regards
> > Manish Gupta
>
>
>
>

manishgupta88

Re: [Discussion] Blocklet DataMap caching in driver

Hi Dev,

I have worked on the design document. Please find below the link for design
document and share your feedback.

https://drive.google.com/open?id=1lN06Pj5tBiBIPSxOBIjK9bpbFVhlUoQA

I have also raised the jira issue and uploaded the design document. Please
find below the jira link.

https://issues.apache.org/jira/browse/CARBONDATA-2638

Regards
Manish Gupta

On Sat, Jun 23, 2018 at 7:40 PM, manish gupta <[hidden email]>
wrote:

> Thanks for the feedback Jacky.
>
> As of now we have min/max at each block and blocklet level and while
> loading the metadata cache we compute the task level min/max. Segment Level
> min/max is not considered as of now but surely this solution can be
> enhanced to consider segment level min/max.
>
> We can discuss further on this in detail and decide whether to consider
> now or enhance it in near future.
>
> Regards
> Manish Gupta
>
> On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <[hidden email]> wrote:
>
>> Hi Manish,
>>
>> +1 for solution 1 for next carbon version. Solution 2 should be also
>> considered, but for a future version after next version.
>>
>> In my previous observation, many scenario user will filter on time range,
>> and since Carbon’s segment is per incremental load which makes it related
>> to time normally. So if we can have minmax for sort_columns for segment
>> level, I think it will further help making driver index minimum. Will you
>> also consider this?
>>
>> Regards,
>> Jacky
>>
>>
>> > 在 2018年6月21日，下午5:24，manish gupta <[hidden email]> 写道：
>> >
>> > Hi Dev,
>> >
>> > The current implementation of Blocklet dataMap caching in driver is
>> that it
>> > caches the min and max values of all the columns in schema by default.
>> >
>> > The problem with this implementation is that as the number of loads
>> > increases the memory required to hold min and max values also increases
>> > considerably. We know that in most of the scenarios there is a single
>> > driver and memory configured for driver is less as compared to executor.
>> > With continuos increase in memory requirement driver can even go out of
>> > memory which makes the situation further worse.
>> >
>> > *Proposed Solution to solve the above problem:*
>> >
>> > Carbondata uses min and max values for blocklet level pruning. It might
>> not
>> > be necessary that user has filter on all the columns specified in the
>> > schema instead it could be only few columns that has filter applied on
>> them
>> > in the query.
>> >
>> > 1. We provide user an option to cache the min and max values of only the
>> > required columns. Caching only the required columns can optimize the
>> > blocklet dataMap memory usage as well as solve the driver memory
>> problem to
>> > a greater extent.
>> >
>> > 2. Using an external storage/DB to cache min and max values. We can also
>> > implement a solution to create a table in the external DB and store min
>> and
>> > max values for all the columns in that table. This will not use any
>> driver
>> > memory and hence the driver memory usage will be optimized further as
>> > compared to solution 1.
>> >
>> > *Solution 1* will not have any performance impact as the user will cache
>> > the required filter columns and it will not have any external dependency
>> > for query execution.
>> > *Solution 2* will degrade the query performance as it will involve
>> querying
>> > for min and max values from external DB required for Blocklet pruning.
>> >
>> > *So from my point of view we should go with solution 1 and in near
>> future
>> > propose a design for solution 2. User can have an option to select
>> between
>> > the 2 options*. Kindly share your suggestions.
>> >
>> > Regards
>> > Manish Gupta
>>
>>
>>
>>
>