Apache CarbonData Dev Mailing List archive

[Discussion] DDLs to operate on CarbonLRUCache

Classic

List

Threaded

11 messages Options

Naman Rastogi

[Discussion] DDLs to operate on CarbonLRUCache

Hi all,

Currently carbon supports caching mechanism for Blocks/Blocklets. Even
though it allows end user to set the Cache size, it is still very
limited in functionality, and user arbitrarily chooses the carbon
property *carbon.max.driver.lru.cache.size* where before launching the
carbon session, he/she has no idea of how much cache should be set for
his/her requirement.

For this problem, I propose the following imporvements in carbon caching
mechanism.

1. Support DDL for showing current cache used per table.
2. Support DDL for showing current cache used for a particular table.
For these two points, QiangCai has already has a PR:
https://github.com/apache/carbondata/pull/3078

3. Support DDL for clearing all the entries cache.
This will look like:
CLEAN CACHE

4. Support DDL for clearing cache for a particular table.
This will clear all the entries in the cache which belong to a
particular table. This will look like
CLEAN CACHE FOR TABLE tablename

5. Support DDL to estimate required cache for a particular table.
As explained above, the user does not know beforehand how much cache
will be required for his/her current work. So this DDL will let the
user estimate how much cache will be required for a particular
table. For this we will launch a job and estimate the memory
required for all the blocks, and sum it up.

6. Dynamic "max cache size" configration
Suppose now the user knows required cache size he needs, but the
current system requires the user to set the
*carbon.max.driver.lru.cache.size* and restart the JDBC server for
it to take effect. For this I am suggesting to make the carbon
property *carbon.max.driver.lru.cache.size* dynamically configurable
which allows the user to change the max LRU cache size on the fly.

Any suggestion from the community is greatly appreciated.

Thanks

Regards

Naman Rastogi
Technical Lead - BigData Kernel
Huawei Technologies India Pvt. Ltd.

dhatchayani

Re: [Discussion] DDLs to operate on CarbonLRUCache

Hi Naman,

This will be very much useful for the users to control the cache_size and
the utilization of cache.

Please clarify me the below point.

Dynamic "max cache size" configuration should be supported?
"carbon.max.driver.lru.cache.size" is a system level configuration whereas
dynamic property is the session level property. We can support the
dynamically SET for which the purpose of the property still holds good to
the system. I think in this case, it does not hold well to the system.

Thanks,
Dhatchayani

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

manishgupta88

Re: [Discussion] DDLs to operate on CarbonLRUCache

Hi Naman

+1 for point 1, 2 and 6.
-1 for point 3, 4 and 5.

1. For point 1, 2 --> Add a design doc to mention all those things that
will be considered for caching while displaying the caching size.
2. For point 3, 4 --> I feel that cleaning of cache should be an internal
thing and not exposed to the user. This might also suppress any bugs that
are there while cleaning the cache at the time of dropping the table. You
can think of stale cache clean up through a separate thread which checks
for stale cache clean up at intervals or you can try to integrate the
functionality with Clean DDL command.
3. For point 5 --> We should think of introducing a command to collect the
System statistics something like Spark and from there we should calculate
the memory requirements instead of exposing a DDL specifically for cache
calculations.

Regards
Manish Gupta

On Tue, Feb 19, 2019 at 7:28 AM dhatchayani <[hidden email]>
wrote:

> Hi Naman,
>
> This will be very much useful for the users to control the cache_size and
> the utilization of cache.
>
> Please clarify me the below point.
>
> Dynamic "max cache size" configuration should be supported?
> "carbon.max.driver.lru.cache.size" is a system level configuration whereas
> dynamic property is the session level property. We can support the
> dynamically SET for which the purpose of the property still holds good to
> the system. I think in this case, it does not hold well to the system.
>
> Thanks,
> Dhatchayani
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

xuchuanyin

Re: [Discussion] DDLs to operate on CarbonLRUCache

+1 for advices from manish

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

David CaiQiang

Re: [Discussion] DDLs to operate on CarbonLRUCache

In reply to this post by Naman Rastogi

+1 for 5,6, after point 5 estimated the cache size, the point 6 can modify
the configuration dynamically.

+1 for 3,4: maybe need to add a lock to sync the concurrent operations. If
it wants to release cache, it will not need to restart the driver

Maybe we also need to check how to use these operations in "[DISCUSSION]
Distributed Index Cache Server".

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

sujith chacko

Re: [Discussion] DDLs to operate on CarbonLRUCache

In reply to this post by Naman Rastogi

Hi Naman,

Thanks for proposing this feature, seems to be pretty interesting feature,
few points i want to bring up here

1) I think we shall require a detailed design for this feature where all
the DDL's you are going to expose shall be clearly mentioned as frequent
updation of DDL's are not recommended in future.
Better you can also cover the scenarios which can impact your DDL
operation like cross session operation of DDL's

eg: one user is trying to clear the cache/table and another user will
execute show cache command. basically you can also mention how you will
handle all the synchronization scenarios.

2) Already Spark has exposed DDL's for clearing the caches as below, please
refer the same and try to get more insights about this DDL.Better to follow
a standard syntax.
"CLEAR CACHE"

"UNCACHE TABLE (IF EXISTS)? tableIdentifier"

3) How you will deal with drop table case, i think you shall clear the
respective cache also. mention these scenarios clearly in your design
document.

4) 0 for point 5, as i think you need to explain more on your design
document about the scenarios and the need of this feature, this ddl can
bring up more complexities to the system
eg: By the time system calculate the table size a new segment can get
added or an existing segment can get modified. so basically again you need
to go for a lock so that these kind
of synchronization issues can be tackle in better manner.

Overall i think the approach shall be well documented before you can start
with implementation. Please let me know for any clarifications or
suggestions regarding above points.

Regards,
Sujith

On Mon, Feb 18, 2019 at 3:35 PM Naman Rastogi <[hidden email]>
wrote:

> Hi all,
>
> Currently carbon supports caching mechanism for Blocks/Blocklets. Even
> though it allows end user to set the Cache size, it is still very
> limited in functionality, and user arbitrarily chooses the carbon
> property *carbon.max.driver.lru.cache.size* where before launching the
> carbon session, he/she has no idea of how much cache should be set for
> his/her requirement.
>
> For this problem, I propose the following imporvements in carbon caching
> mechanism.
>
> 1. Support DDL for showing current cache used per table.
> 2. Support DDL for showing current cache used for a particular table.
> For these two points, QiangCai has already has a PR:
> https://github.com/apache/carbondata/pull/3078
>
> 3. Support DDL for clearing all the entries cache.
> This will look like:
> CLEAN CACHE
>
> 4. Support DDL for clearing cache for a particular table.
> This will clear all the entries in the cache which belong to a
> particular table. This will look like
> CLEAN CACHE FOR TABLE tablename
>
> 5. Support DDL to estimate required cache for a particular table.
> As explained above, the user does not know beforehand how much cache
> will be required for his/her current work. So this DDL will let the
> user estimate how much cache will be required for a particular
> table. For this we will launch a job and estimate the memory
> required for all the blocks, and sum it up.
>
> 6. Dynamic "max cache size" configration
> Suppose now the user knows required cache size he needs, but the
> current system requires the user to set the
> *carbon.max.driver.lru.cache.size* and restart the JDBC server for
> it to take effect. For this I am suggesting to make the carbon
> property *carbon.max.driver.lru.cache.size* dynamically configurable
> which allows the user to change the max LRU cache size on the fly.
>
> Any suggestion from the community is greatly appreciated.
>
> Thanks
>
> Regards
>
> Naman Rastogi
> Technical Lead - BigData Kernel
> Huawei Technologies India Pvt. Ltd.
>

akashrn5

Re: [Discussion] DDLs to operate on CarbonLRUCache

In reply to this post by Naman Rastogi

Hi naman,

Thanks for proposing the feature. Looks really helpful from user and
developer perspective.

Basically needed the design document, so that all the doubts would be
cleared.

1. Basicaly how you are going to handle the sync issues like, multiple
queries with drop and show cache. are you going to introduce any locking
mechanism?

2. if user clears cache during query? how it is noing to behave? is it
allowed or concurrent operaion is blocked?

3. How it os gonna work with Distributed index server, and types of it like
embedded, presto and other and local server? basically what is the impact
with that?

4. You said you will launch a job to get the size from all the blocks
present. Currently we create the block or blocklet datamap, and calcute each
datamap size and then we add to cache based on the lru cach esize
configured. So wanted to know how you will be calculating the size in your
case

Regards,
Akash

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Naman Rastogi

Re: [Discussion] DDLs to operate on CarbonLRUCache

@dhatchayani

If we are to support estimation of cache size required for a particular
table, we have to make max_cache_size dynamic, and change the cache size
dynamically, otherwise user will have to restart the JDBC server all
over again, which does not seem like a good idea from the perspective of
end user. So, it will be like one default system property will be there
(CARBON.MAX.DRIVER.LRU.CACHE.SIZE) which will be considered while
starting the server, and then user, later, can change the size of cache
according to his need on the fly using session property
CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC (or something else) using set
command
set CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC=100
here 100 is in MB, which is similar to CARBON.MAX.DRIVER.LRU.CACHE.SIZE.

One restriction we can add over this is that user, while changing the
size of cache, can only increase the cache size, and not decrease it, so
that we dont have to the the remove some entries which are already
present in the cache. This is also open to discussion, that we should
add this restriction or not.

So, this way, user will be able to start the server with some value for
max cache size, can be -1 as well, then change the max cache size on the
fly.
Regards

Naman Rastogi
Technical Lead - BigData Kernel
Huawei Technologies India Pvt. Ltd.

On Tue, Feb 19, 2019 at 6:51 PM akashrn5 <[hidden email]> wrote:

> Hi naman,
>
> Thanks for proposing the feature. Looks really helpful from user and
> developer perspective.
>
> Basically needed the design document, so that all the doubts would be
> cleared.
>
> 1. Basicaly how you are going to handle the sync issues like, multiple
> queries with drop and show cache. are you going to introduce any locking
> mechanism?
>
> 2. if user clears cache during query? how it is noing to behave? is it
> allowed or concurrent operaion is blocked?
>
> 3. How it os gonna work with Distributed index server, and types of it like
> embedded, presto and other and local server? basically what is the impact
> with that?
>
> 4. You said you will launch a job to get the size from all the blocks
> present. Currently we create the block or blocklet datamap, and calcute
> each
> datamap size and then we add to cache based on the lru cach esize
> configured. So wanted to know how you will be calculating the size in your
> case
>
> Regards,
> Akash
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Naman Rastogi

Re: [Discussion] DDLs to operate on CarbonLRUCache

@akash

1. No sync issues will be there are the operation will be atomic, only
one operation is allowed on lruCache at a particular point of time (will
be taken care in code).

2. It will act as it is expected to, concurrent operation on cache will
be blocked.

3 & 4. We will do it the same way we do now. Summing up memory size for
all the BlockDataMap -> summing up for all the DataMapRow.
Just that we will distribute that among different executors.

Regards

Naman Rastogi
Technical Lead - BigData Kernel
Huawei Technologies India Pvt. Ltd.

On Wed, Feb 20, 2019 at 7:51 PM Naman Rastogi <[hidden email]>
wrote:

> @dhatchayani
>
> If we are to support estimation of cache size required for a particular
> table, we have to make max_cache_size dynamic, and change the cache size
> dynamically, otherwise user will have to restart the JDBC server all
> over again, which does not seem like a good idea from the perspective of
> end user. So, it will be like one default system property will be there
> (CARBON.MAX.DRIVER.LRU.CACHE.SIZE) which will be considered while
> starting the server, and then user, later, can change the size of cache
> according to his need on the fly using session property
> CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC (or something else) using set
> command
> set CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC=100
> here 100 is in MB, which is similar to CARBON.MAX.DRIVER.LRU.CACHE.SIZE.
>
> One restriction we can add over this is that user, while changing the
> size of cache, can only increase the cache size, and not decrease it, so
> that we dont have to the the remove some entries which are already
> present in the cache. This is also open to discussion, that we should
> add this restriction or not.
>
> So, this way, user will be able to start the server with some value for
> max cache size, can be -1 as well, then change the max cache size on the
> fly.
> Regards
>
> Naman Rastogi
> Technical Lead - BigData Kernel
> Huawei Technologies India Pvt. Ltd.
>
>
> On Tue, Feb 19, 2019 at 6:51 PM akashrn5 <[hidden email]> wrote:
>
>> Hi naman,
>>
>> Thanks for proposing the feature. Looks really helpful from user and
>> developer perspective.
>>
>> Basically needed the design document, so that all the doubts would be
>> cleared.
>>
>> 1. Basicaly how you are going to handle the sync issues like, multiple
>> queries with drop and show cache. are you going to introduce any locking
>> mechanism?
>>
>> 2. if user clears cache during query? how it is noing to behave? is it
>> allowed or concurrent operaion is blocked?
>>
>> 3. How it os gonna work with Distributed index server, and types of it
>> like
>> embedded, presto and other and local server? basically what is the impact
>> with that?
>>
>> 4. You said you will launch a job to get the size from all the blocks
>> present. Currently we create the block or blocklet datamap, and calcute
>> each
>> datamap size and then we add to cache based on the lru cach esize
>> configured. So wanted to know how you will be calculating the size in your
>> case
>>
>> Regards,
>> Akash
>>
>>
>>
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>
>

Naman Rastogi

Re: [Discussion] DDLs to operate on CarbonLRUCache

In reply to this post by manishgupta88

@manish

I looked at the code of CarbonLRUCache, and I dont
think we have actual implementation of LRU for cache. I am planning to
make it actual Least-Recently-Used caching mechanism.
If we make it actual LRU cache, the problem of stale elements in the
cache should be resolved, because, they wont be accessed for some time
and untimately removed from the cache when some other entries are being
entered in the cache and cache is full.

Other thing I could think of is Counter-Based caching mechanism. What
should be it, the LRU based caching or counter based caching?

Ok, if DDLs does not look a good idea for this, we can also use
CarbonCLI also. What do you think on this?

@Sujith

1. Yes, as soon as the discussion reaches a conslusion, what DDLs to
support and what to omit, I will share a design document.

3. Yes, drop table automatically clears the cache for the table.

4. Yes, it may happen, but the estimation is only to give the user a
rough idea of how much memory the table will acquire in cache. User will
accordingly configure the cache size, with some slack.

Regards

Naman Rastogi
Technical Lead - BigData Kernel
Huawei Technologies India Pvt. Ltd.

On Tue, Feb 19, 2019 at 11:15 AM manish gupta <[hidden email]>
wrote:

> Hi Naman
>
> +1 for point 1, 2 and 6.
> -1 for point 3, 4 and 5.
>
> 1. For point 1, 2 --> Add a design doc to mention all those things that
> will be considered for caching while displaying the caching size.
> 2. For point 3, 4 --> I feel that cleaning of cache should be an internal
> thing and not exposed to the user. This might also suppress any bugs that
> are there while cleaning the cache at the time of dropping the table. You
> can think of stale cache clean up through a separate thread which checks
> for stale cache clean up at intervals or you can try to integrate the
> functionality with Clean DDL command.
> 3. For point 5 --> We should think of introducing a command to collect the
> System statistics something like Spark and from there we should calculate
> the memory requirements instead of exposing a DDL specifically for cache
> calculations.
>
> Regards
> Manish Gupta
>
> On Tue, Feb 19, 2019 at 7:28 AM dhatchayani <[hidden email]>
> wrote:
>
> > Hi Naman,
> >
> > This will be very much useful for the users to control the cache_size and
> > the utilization of cache.
> >
> > Please clarify me the below point.
> >
> > Dynamic "max cache size" configuration should be supported?
> > "carbon.max.driver.lru.cache.size" is a system level configuration
> whereas
> > dynamic property is the session level property. We can support the
> > dynamically SET for which the purpose of the property still holds good to
> > the system. I think in this case, it does not hold well to the system.
> >
> > Thanks,
> > Dhatchayani
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>

litao

Re: [Discussion] DDLs to operate on CarbonLRUCache

In reply to this post by Naman Rastogi

Hi Naman

I think in the current version capabilities, it is necessary to do some ddl
support.
+1 for point 1,2
In Point 1 or 2, is it possible output a list of information for all
tables,to use limit as a filter.In this way, user can see the whole pciture
of the cache and guide subsequent operations.

+1 for pint 3,4
I think point 3 and point 4 is very userful for maintenance and tuning. As
above brothers said,concurrent operations need to carefully considered. If
there is a lock,it should as the table level.

for point 5，6, i think that cai qiang said is reasonable,we shoule check how
it use under "[DISCUSSION]
Distributed Index Cache Server".

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/