Hi all,
Currently carbon supports caching mechanism for Blocks/Blocklets. Even though it allows end user to set the Cache size, it is still very limited in functionality, and user arbitrarily chooses the carbon property *carbon.max.driver.lru.cache.size* where before launching the carbon session, he/she has no idea of how much cache should be set for his/her requirement. For this problem, I propose the following imporvements in carbon caching mechanism. 1. Support DDL for showing current cache used per table. 2. Support DDL for showing current cache used for a particular table. For these two points, QiangCai has already has a PR: https://github.com/apache/carbondata/pull/3078 3. Support DDL for clearing all the entries cache. This will look like: CLEAN CACHE 4. Support DDL for clearing cache for a particular table. This will clear all the entries in the cache which belong to a particular table. This will look like CLEAN CACHE FOR TABLE tablename 5. Support DDL to estimate required cache for a particular table. As explained above, the user does not know beforehand how much cache will be required for his/her current work. So this DDL will let the user estimate how much cache will be required for a particular table. For this we will launch a job and estimate the memory required for all the blocks, and sum it up. 6. Dynamic "max cache size" configration Suppose now the user knows required cache size he needs, but the current system requires the user to set the *carbon.max.driver.lru.cache.size* and restart the JDBC server for it to take effect. For this I am suggesting to make the carbon property *carbon.max.driver.lru.cache.size* dynamically configurable which allows the user to change the max LRU cache size on the fly. Any suggestion from the community is greatly appreciated. Thanks Regards Naman Rastogi Technical Lead - BigData Kernel Huawei Technologies India Pvt. Ltd. |
Hi Naman,
This will be very much useful for the users to control the cache_size and the utilization of cache. Please clarify me the below point. Dynamic "max cache size" configuration should be supported? "carbon.max.driver.lru.cache.size" is a system level configuration whereas dynamic property is the session level property. We can support the dynamically SET for which the purpose of the property still holds good to the system. I think in this case, it does not hold well to the system. Thanks, Dhatchayani -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Naman
+1 for point 1, 2 and 6. -1 for point 3, 4 and 5. 1. For point 1, 2 --> Add a design doc to mention all those things that will be considered for caching while displaying the caching size. 2. For point 3, 4 --> I feel that cleaning of cache should be an internal thing and not exposed to the user. This might also suppress any bugs that are there while cleaning the cache at the time of dropping the table. You can think of stale cache clean up through a separate thread which checks for stale cache clean up at intervals or you can try to integrate the functionality with Clean DDL command. 3. For point 5 --> We should think of introducing a command to collect the System statistics something like Spark and from there we should calculate the memory requirements instead of exposing a DDL specifically for cache calculations. Regards Manish Gupta On Tue, Feb 19, 2019 at 7:28 AM dhatchayani <[hidden email]> wrote: > Hi Naman, > > This will be very much useful for the users to control the cache_size and > the utilization of cache. > > Please clarify me the below point. > > Dynamic "max cache size" configuration should be supported? > "carbon.max.driver.lru.cache.size" is a system level configuration whereas > dynamic property is the session level property. We can support the > dynamically SET for which the purpose of the property still holds good to > the system. I think in this case, it does not hold well to the system. > > Thanks, > Dhatchayani > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
+1 for advices from manish
-- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Naman Rastogi
+1 for 5,6, after point 5 estimated the cache size, the point 6 can modify
the configuration dynamically. +1 for 3,4: maybe need to add a lock to sync the concurrent operations. If it wants to release cache, it will not need to restart the driver Maybe we also need to check how to use these operations in "[DISCUSSION] Distributed Index Cache Server". ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
In reply to this post by Naman Rastogi
Hi Naman,
Thanks for proposing this feature, seems to be pretty interesting feature, few points i want to bring up here 1) I think we shall require a detailed design for this feature where all the DDL's you are going to expose shall be clearly mentioned as frequent updation of DDL's are not recommended in future. Better you can also cover the scenarios which can impact your DDL operation like cross session operation of DDL's eg: one user is trying to clear the cache/table and another user will execute show cache command. basically you can also mention how you will handle all the synchronization scenarios. 2) Already Spark has exposed DDL's for clearing the caches as below, please refer the same and try to get more insights about this DDL.Better to follow a standard syntax. "CLEAR CACHE" "UNCACHE TABLE (IF EXISTS)? tableIdentifier" 3) How you will deal with drop table case, i think you shall clear the respective cache also. mention these scenarios clearly in your design document. 4) 0 for point 5, as i think you need to explain more on your design document about the scenarios and the need of this feature, this ddl can bring up more complexities to the system eg: By the time system calculate the table size a new segment can get added or an existing segment can get modified. so basically again you need to go for a lock so that these kind of synchronization issues can be tackle in better manner. Overall i think the approach shall be well documented before you can start with implementation. Please let me know for any clarifications or suggestions regarding above points. Regards, Sujith On Mon, Feb 18, 2019 at 3:35 PM Naman Rastogi <[hidden email]> wrote: > Hi all, > > Currently carbon supports caching mechanism for Blocks/Blocklets. Even > though it allows end user to set the Cache size, it is still very > limited in functionality, and user arbitrarily chooses the carbon > property *carbon.max.driver.lru.cache.size* where before launching the > carbon session, he/she has no idea of how much cache should be set for > his/her requirement. > > For this problem, I propose the following imporvements in carbon caching > mechanism. > > 1. Support DDL for showing current cache used per table. > 2. Support DDL for showing current cache used for a particular table. > For these two points, QiangCai has already has a PR: > https://github.com/apache/carbondata/pull/3078 > > 3. Support DDL for clearing all the entries cache. > This will look like: > CLEAN CACHE > > 4. Support DDL for clearing cache for a particular table. > This will clear all the entries in the cache which belong to a > particular table. This will look like > CLEAN CACHE FOR TABLE tablename > > 5. Support DDL to estimate required cache for a particular table. > As explained above, the user does not know beforehand how much cache > will be required for his/her current work. So this DDL will let the > user estimate how much cache will be required for a particular > table. For this we will launch a job and estimate the memory > required for all the blocks, and sum it up. > > 6. Dynamic "max cache size" configration > Suppose now the user knows required cache size he needs, but the > current system requires the user to set the > *carbon.max.driver.lru.cache.size* and restart the JDBC server for > it to take effect. For this I am suggesting to make the carbon > property *carbon.max.driver.lru.cache.size* dynamically configurable > which allows the user to change the max LRU cache size on the fly. > > Any suggestion from the community is greatly appreciated. > > Thanks > > Regards > > Naman Rastogi > Technical Lead - BigData Kernel > Huawei Technologies India Pvt. Ltd. > |
In reply to this post by Naman Rastogi
Hi naman,
Thanks for proposing the feature. Looks really helpful from user and developer perspective. Basically needed the design document, so that all the doubts would be cleared. 1. Basicaly how you are going to handle the sync issues like, multiple queries with drop and show cache. are you going to introduce any locking mechanism? 2. if user clears cache during query? how it is noing to behave? is it allowed or concurrent operaion is blocked? 3. How it os gonna work with Distributed index server, and types of it like embedded, presto and other and local server? basically what is the impact with that? 4. You said you will launch a job to get the size from all the blocks present. Currently we create the block or blocklet datamap, and calcute each datamap size and then we add to cache based on the lru cach esize configured. So wanted to know how you will be calculating the size in your case Regards, Akash -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
@dhatchayani
If we are to support estimation of cache size required for a particular table, we have to make max_cache_size dynamic, and change the cache size dynamically, otherwise user will have to restart the JDBC server all over again, which does not seem like a good idea from the perspective of end user. So, it will be like one default system property will be there (CARBON.MAX.DRIVER.LRU.CACHE.SIZE) which will be considered while starting the server, and then user, later, can change the size of cache according to his need on the fly using session property CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC (or something else) using set command set CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC=100 here 100 is in MB, which is similar to CARBON.MAX.DRIVER.LRU.CACHE.SIZE. One restriction we can add over this is that user, while changing the size of cache, can only increase the cache size, and not decrease it, so that we dont have to the the remove some entries which are already present in the cache. This is also open to discussion, that we should add this restriction or not. So, this way, user will be able to start the server with some value for max cache size, can be -1 as well, then change the max cache size on the fly. Regards Naman Rastogi Technical Lead - BigData Kernel Huawei Technologies India Pvt. Ltd. On Tue, Feb 19, 2019 at 6:51 PM akashrn5 <[hidden email]> wrote: > Hi naman, > > Thanks for proposing the feature. Looks really helpful from user and > developer perspective. > > Basically needed the design document, so that all the doubts would be > cleared. > > 1. Basicaly how you are going to handle the sync issues like, multiple > queries with drop and show cache. are you going to introduce any locking > mechanism? > > 2. if user clears cache during query? how it is noing to behave? is it > allowed or concurrent operaion is blocked? > > 3. How it os gonna work with Distributed index server, and types of it like > embedded, presto and other and local server? basically what is the impact > with that? > > 4. You said you will launch a job to get the size from all the blocks > present. Currently we create the block or blocklet datamap, and calcute > each > datamap size and then we add to cache based on the lru cach esize > configured. So wanted to know how you will be calculating the size in your > case > > Regards, > Akash > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
@akash
1. No sync issues will be there are the operation will be atomic, only one operation is allowed on lruCache at a particular point of time (will be taken care in code). 2. It will act as it is expected to, concurrent operation on cache will be blocked. 3 & 4. We will do it the same way we do now. Summing up memory size for all the BlockDataMap -> summing up for all the DataMapRow. Just that we will distribute that among different executors. Regards Naman Rastogi Technical Lead - BigData Kernel Huawei Technologies India Pvt. Ltd. On Wed, Feb 20, 2019 at 7:51 PM Naman Rastogi <[hidden email]> wrote: > @dhatchayani > > If we are to support estimation of cache size required for a particular > table, we have to make max_cache_size dynamic, and change the cache size > dynamically, otherwise user will have to restart the JDBC server all > over again, which does not seem like a good idea from the perspective of > end user. So, it will be like one default system property will be there > (CARBON.MAX.DRIVER.LRU.CACHE.SIZE) which will be considered while > starting the server, and then user, later, can change the size of cache > according to his need on the fly using session property > CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC (or something else) using set > command > set CARBON.MAX.DRIVER.LRU.CACHE.SIZE.DYNAMIC=100 > here 100 is in MB, which is similar to CARBON.MAX.DRIVER.LRU.CACHE.SIZE. > > One restriction we can add over this is that user, while changing the > size of cache, can only increase the cache size, and not decrease it, so > that we dont have to the the remove some entries which are already > present in the cache. This is also open to discussion, that we should > add this restriction or not. > > So, this way, user will be able to start the server with some value for > max cache size, can be -1 as well, then change the max cache size on the > fly. > Regards > > Naman Rastogi > Technical Lead - BigData Kernel > Huawei Technologies India Pvt. Ltd. > > > On Tue, Feb 19, 2019 at 6:51 PM akashrn5 <[hidden email]> wrote: > >> Hi naman, >> >> Thanks for proposing the feature. Looks really helpful from user and >> developer perspective. >> >> Basically needed the design document, so that all the doubts would be >> cleared. >> >> 1. Basicaly how you are going to handle the sync issues like, multiple >> queries with drop and show cache. are you going to introduce any locking >> mechanism? >> >> 2. if user clears cache during query? how it is noing to behave? is it >> allowed or concurrent operaion is blocked? >> >> 3. How it os gonna work with Distributed index server, and types of it >> like >> embedded, presto and other and local server? basically what is the impact >> with that? >> >> 4. You said you will launch a job to get the size from all the blocks >> present. Currently we create the block or blocklet datamap, and calcute >> each >> datamap size and then we add to cache based on the lru cach esize >> configured. So wanted to know how you will be calculating the size in your >> case >> >> Regards, >> Akash >> >> >> >> -- >> Sent from: >> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >> > |
In reply to this post by manishgupta88
@manish
I looked at the code of CarbonLRUCache, and I dont think we have actual implementation of LRU for cache. I am planning to make it actual Least-Recently-Used caching mechanism. If we make it actual LRU cache, the problem of stale elements in the cache should be resolved, because, they wont be accessed for some time and untimately removed from the cache when some other entries are being entered in the cache and cache is full. Other thing I could think of is Counter-Based caching mechanism. What should be it, the LRU based caching or counter based caching? Ok, if DDLs does not look a good idea for this, we can also use CarbonCLI also. What do you think on this? @Sujith 1. Yes, as soon as the discussion reaches a conslusion, what DDLs to support and what to omit, I will share a design document. 3. Yes, drop table automatically clears the cache for the table. 4. Yes, it may happen, but the estimation is only to give the user a rough idea of how much memory the table will acquire in cache. User will accordingly configure the cache size, with some slack. Regards Naman Rastogi Technical Lead - BigData Kernel Huawei Technologies India Pvt. Ltd. On Tue, Feb 19, 2019 at 11:15 AM manish gupta <[hidden email]> wrote: > Hi Naman > > +1 for point 1, 2 and 6. > -1 for point 3, 4 and 5. > > 1. For point 1, 2 --> Add a design doc to mention all those things that > will be considered for caching while displaying the caching size. > 2. For point 3, 4 --> I feel that cleaning of cache should be an internal > thing and not exposed to the user. This might also suppress any bugs that > are there while cleaning the cache at the time of dropping the table. You > can think of stale cache clean up through a separate thread which checks > for stale cache clean up at intervals or you can try to integrate the > functionality with Clean DDL command. > 3. For point 5 --> We should think of introducing a command to collect the > System statistics something like Spark and from there we should calculate > the memory requirements instead of exposing a DDL specifically for cache > calculations. > > Regards > Manish Gupta > > On Tue, Feb 19, 2019 at 7:28 AM dhatchayani <[hidden email]> > wrote: > > > Hi Naman, > > > > This will be very much useful for the users to control the cache_size and > > the utilization of cache. > > > > Please clarify me the below point. > > > > Dynamic "max cache size" configuration should be supported? > > "carbon.max.driver.lru.cache.size" is a system level configuration > whereas > > dynamic property is the session level property. We can support the > > dynamically SET for which the purpose of the property still holds good to > > the system. I think in this case, it does not hold well to the system. > > > > Thanks, > > Dhatchayani > > > > > > > > -- > > Sent from: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > |
In reply to this post by Naman Rastogi
Hi Naman
I think in the current version capabilities, it is necessary to do some ddl support. +1 for point 1,2 In Point 1 or 2, is it possible output a list of information for all tables,to use limit as a filter.In this way, user can see the whole pciture of the cache and guide subsequent operations. +1 for pint 3,4 I think point 3 and point 4 is very userful for maintenance and tuning. As above brothers said,concurrent operations need to carefully considered. If there is a lock,it should as the table level. for point 5,6, i think that cai qiang said is reasonable,we shoule check how it use under "[DISCUSSION] Distributed Index Cache Server". -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |