[DISCUSSION] Cache Pre Priming

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
Hi Community,

Currently, we have an index server which basically helps in distributed caching of the datamaps in a separate spark application.

The caching of the datamaps in index server will start once the query is fired on the table for the first time, all the datamaps will be loaded

if the count(*) is fired and only required will be loaded for any filter query.


Here the problem or the bottleneck is, until and unless the query is fired on table, the caching won’t be done for the table datamaps.

So consider a scenario where we are just loading the data to table for whole day and then next day we query,

so all the segments will start loading into cache. So first time the query will be slow.


What if we load the datamaps into cache or preprime the cache without waititng for any query on the table?

Yes, what if we load the cache after every load is done, what if we load the cache for all the segments at once,

so that first time query need not do all this job, which makes it faster.


Here i have attached the design document for the pre-priming of cache into index server. Please have a look at it

and any suggestions or inputs on this are most welcomed.


https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing



Regards,

Akash R Nilugal

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

manhua
Hi Akash,
Could you please raise a JIRA and attach the design doc? I cannot access


Thanks


---Original---
From: "Akash Nilugal"<[hidden email]>
Date: Thu, Aug 15, 2019 20:03 PM
To: "dev"<[hidden email]>;
Subject: [DISCUSSION] Cache Pre Priming


Hi Community,

 
 Currently, we have an index server which basically helps in distributed caching of the datamaps in a separate spark application.

The caching of the datamaps in index server will start once the query is fired on the table for the first time, all the datamaps will be loaded

if the count(*) is fired and only required will be loaded for any filter query.  
 
 
 
 
 Here the problem or the bottleneck is, until and unless the query is fired on table, the caching won’t be done for the table datamaps.

So consider a scenario where we are just loading the data to table for whole day and then next day we query,

so all the segments will start loading into cache. So first time the query will be slow.




 What if we load the datamaps into cache or preprime the cache without  waititng for any query on the table?

 

 Yes, what if we load the cache after every load is done, what if we load the cache for all the segments at once,

so that first time query need not do all this job, which makes it faster.




Here i have attached the design document for the pre-priming of cache into index server. Please have a look at it

and any suggestions or inputs on this are most welcomed.




https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing








Regards,

Akash R Nilugal
Regards
Manhua
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by akashnilugal@gmail.com
Hi All,

I have raised a jira and attached the design doc there .please refer

CARBONDATA - 3492

Regards,
Akash

On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]> wrote:

> Hi Community,
>
> Currently, we have an index server which basically helps in distributed
> caching of the datamaps in a separate spark application.
>
> The caching of the datamaps in index server will start once the query is
> fired on the table for the first time, all the datamaps will be loaded
>
> if the count(*) is fired and only required will be loaded for any filter
> query.
>
>
> Here the problem or the bottleneck is, until and unless the query is fired
> on table, the caching won’t be done for the table datamaps.
>
> So consider a scenario where we are just loading the data to table for
> whole day and then next day we query,
>
> so all the segments will start loading into cache. So first time the query
> will be slow.
>
>
> What if we load the datamaps into cache or preprime the cache without
> waititng for any query on the table?
>
> Yes, what if we load the cache after every load is done, what if we load
> the cache for all the segments at once,
>
> so that first time query need not do all this job, which makes it faster.
>
>
> Here i have attached the design document for the pre-priming of cache into
> index server. Please have a look at it
>
> and any suggestions or inputs on this are most welcomed.
>
>
>
> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
>
>
>
> Regards,
>
> Akash R Nilugal
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

xuchuanyin
In reply to this post by akashnilugal@gmail.com
Hi, I've two questions about the current index server implementation:

1. Currently do we need to load all the index data of all segments to cache
server while doing filter query OR only load the segments required by this
query?

2. When do we trigger the cache loading action during the query?

As for the proposal in this mail, what will happen if auto-compaction occur
for this loading?

3. Since we want to preload the index to cache, maybe we need to handle all
the scenarios that causing data ingestion, so it seems you forget the SDK
scenario.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

Manhua-2
In reply to this post by akashnilugal@gmail.com
Hi, I come up with following ideas:

1. Although index server can provide more memory to hold the cache for index data, its space still has a limit.

So cache managment(especially cache invalid) should be paid attention if we Pre-Prime during data load or start of index server which easily fill up memory of index server as time goes by.

2.  Pre-Prime is an extended optimization, and it should be focus more on what want to optimize.

So, about the cache way for pre-prime, I think the configuration can support a regex/wildcard match list:

- During start of index server, check and pre-prime matched EXISTED table;
- During data load, check and pre-prime matched NEW table or NEW segment;

This can lighten the workload, keeping targeted table cached  in case of swap out when many index loaded to cache

3. Cache command can be another ways to Pre-Prime, manually. For test or embed in code.



On 2019/08/16 10:56:33, Akash Nilugal <[hidden email]> wrote:

> Hi All,
>
> I have raised a jira and attached the design doc there .please refer
>
> CARBONDATA - 3492
>
> Regards,
> Akash
>
> On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]> wrote:
>
> > Hi Community,
> >
> > Currently, we have an index server which basically helps in distributed
> > caching of the datamaps in a separate spark application.
> >
> > The caching of the datamaps in index server will start once the query is
> > fired on the table for the first time, all the datamaps will be loaded
> >
> > if the count(*) is fired and only required will be loaded for any filter
> > query.
> >
> >
> > Here the problem or the bottleneck is, until and unless the query is fired
> > on table, the caching won’t be done for the table datamaps.
> >
> > So consider a scenario where we are just loading the data to table for
> > whole day and then next day we query,
> >
> > so all the segments will start loading into cache. So first time the query
> > will be slow.
> >
> >
> > What if we load the datamaps into cache or preprime the cache without
> > waititng for any query on the table?
> >
> > Yes, what if we load the cache after every load is done, what if we load
> > the cache for all the segments at once,
> >
> > so that first time query need not do all this job, which makes it faster.
> >
> >
> > Here i have attached the design document for the pre-priming of cache into
> > index server. Please have a look at it
> >
> > and any suggestions or inputs on this are most welcomed.
> >
> >
> >
> > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> >
> >
> >
> > Regards,
> >
> > Akash R Nilugal
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by akashnilugal@gmail.com
Hi manhua,

Thanks for the inputs.

1. No need to take care separately to invalidate the cache, i agree that it
will have limit. Since we already have eviction policy, when next query
comes, whenever required, it will evict and load the segments required, so
better not to have a separate mechanism to invalidate cache during
pre-prime.

2.
i. For configuration support of pre-prime, already we can have the database
name or table name, about the regex support, we will note it, and based on
other use case and impacts, i will update the design document.
ii. During load no need to load the table or read any configuration for
pre-prime. During load pre-prime, just take the current new segment and
load into cache.

3. For command support, can you please explain with more use cases. Because
current index server startup will load, and when you say command, even if i
do count(*) also, that will load all the segments. So i think new command
won't be necessary.

Please get back for any clarifications or doubts.

Thanks

Regards,
Akash R Nilugal

On Fri, Aug 16, 2019, 4:26 PM Akash Nilugal <[hidden email]> wrote:

> Hi All,
>
> I have raised a jira and attached the design doc there .please refer
>
> CARBONDATA - 3492
>
> Regards,
> Akash
>
> On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]>
> wrote:
>
>> Hi Community,
>>
>> Currently, we have an index server which basically helps in distributed
>> caching of the datamaps in a separate spark application.
>>
>> The caching of the datamaps in index server will start once the query is
>> fired on the table for the first time, all the datamaps will be loaded
>>
>> if the count(*) is fired and only required will be loaded for any filter
>> query.
>>
>>
>> Here the problem or the bottleneck is, until and unless the query is
>> fired on table, the caching won’t be done for the table datamaps.
>>
>> So consider a scenario where we are just loading the data to table for
>> whole day and then next day we query,
>>
>> so all the segments will start loading into cache. So first time the
>> query will be slow.
>>
>>
>> What if we load the datamaps into cache or preprime the cache without
>> waititng for any query on the table?
>>
>> Yes, what if we load the cache after every load is done, what if we load
>> the cache for all the segments at once,
>>
>> so that first time query need not do all this job, which makes it faster.
>>
>>
>> Here i have attached the design document for the pre-priming of cache
>> into index server. Please have a look at it
>>
>> and any suggestions or inputs on this are most welcomed.
>>
>>
>>
>> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
>>
>>
>>
>> Regards,
>>
>> Akash R Nilugal
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashrn5
In reply to this post by xuchuanyin
Hi xuchianyin,

Thanks for the question

1. Currently implementation is no need to load all the segments, only
required will be loaded during filter and all segments will be loaded
during query like count *.

2. Cache loading  is fired during pruning phase in query, it will go to
index server prune and load to cache , if index server is disabled and if
distributed pruning is enabled then distributed pruning happens else driver
side pruning, please check the index server Design doc for more info on
this.

For auto compaction, no need to load to index server, because internally
one more level of compaction can happen and old loaded segments can become
invalid, I will handle this is Design document.

3. Index server is a separate spark application meant for caching , so  for
SDK , spark session doesn't come into picture, so SDK not applicable, for
file format case we will handle.


Please get back for any clarifications or inputs.

Thanks and Regards

Akash R Nilugal


From: xuchuanyin <[hidden email]>

> Date: Sat, 17 Aug, 2019, 11:55 AM
> Subject: Re: [DISCUSSION] Cache Pre Priming
> To: <[hidden email]>
>
>
> Hi, I've two questions about the current index server implementation:
>
> 1. Currently do we need to load all the index data of all segments to cache
> server while doing filter query OR only load the segments required by this
> query?
>
> 2. When do we trigger the cache loading action during the query?
>
> As for the proposal in this mail, what will happen if auto-compaction occur
> for this loading?
>
> 3. Since we want to preload the index to cache, maybe we need to handle all
> the scenarios that causing data ingestion, so it seems you forget the SDK
> scenario.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by akashnilugal@gmail.com
Hi xuchianyin,

Thanks for the question

1. Currently implementation is no need to load all the segments, only
required will be loaded during filter and all segments will be loaded
during query like count *.

2. Cache loading  is fired during pruning phase in query, it will go to
index server prune and load to cache , if index server is disabled and if
distributed pruning is enabled then distributed pruning happens else driver
side pruning, please check the index server Design doc for more info on
this.

For auto compaction, no need to load to index server, because internally
one more level of compaction can happen and old loaded segments can become
invalid, I will handle this is Design document.

3. Index server is a separate spark application meant for caching , so  for
SDK , spark session doesn't come into picture, so SDK not applicable, for
file format case we will handle.


Please get back for any clarifications or inputs.

Thanks and Regards

Akash R Nilugal

On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]> wrote:

> Hi Community,
>
> Currently, we have an index server which basically helps in distributed
> caching of the datamaps in a separate spark application.
>
> The caching of the datamaps in index server will start once the query is
> fired on the table for the first time, all the datamaps will be loaded
>
> if the count(*) is fired and only required will be loaded for any filter
> query.
>
>
> Here the problem or the bottleneck is, until and unless the query is fired
> on table, the caching won’t be done for the table datamaps.
>
> So consider a scenario where we are just loading the data to table for
> whole day and then next day we query,
>
> so all the segments will start loading into cache. So first time the query
> will be slow.
>
>
> What if we load the datamaps into cache or preprime the cache without
> waititng for any query on the table?
>
> Yes, what if we load the cache after every load is done, what if we load
> the cache for all the segments at once,
>
> so that first time query need not do all this job, which makes it faster.
>
>
> Here i have attached the design document for the pre-priming of cache into
> index server. Please have a look at it
>
> and any suggestions or inputs on this are most welcomed.
>
>
>
> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
>
>
>
> Regards,
>
> Akash R Nilugal
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

Manhua-2
In reply to this post by akashnilugal@gmail.com
Hi Akash,

1. cache will be full when loading is still running all the time. the reason I mention the invalidation is to avoid case, specially, when cache is full before all targeted index is loaded.

When server just starting, keeping pre-prime and swap out the earliest loaded index is not good.
Maybe pre-prime need to check the capacity of available cache before load index, else stop pre-prime any more?

2. I think regex/wildcard is more flexible to use,
such as :
*.* for all dbs and tables
test.* for all tables in test db
test.day_table_201908* for table has targeted prefix

3. yes, you are right, fire a count(*) can do that.


On 2019/08/19 09:23:06, Akash Nilugal <[hidden email]> wrote:

> Hi manhua,
>
> Thanks for the inputs.
>
> 1. No need to take care separately to invalidate the cache, i agree that it
> will have limit. Since we already have eviction policy, when next query
> comes, whenever required, it will evict and load the segments required, so
> better not to have a separate mechanism to invalidate cache during
> pre-prime.
>
> 2.
> i. For configuration support of pre-prime, already we can have the database
> name or table name, about the regex support, we will note it, and based on
> other use case and impacts, i will update the design document.
> ii. During load no need to load the table or read any configuration for
> pre-prime. During load pre-prime, just take the current new segment and
> load into cache.
>
> 3. For command support, can you please explain with more use cases. Because
> current index server startup will load, and when you say command, even if i
> do count(*) also, that will load all the segments. So i think new command
> won't be necessary.
>
> Please get back for any clarifications or doubts.
>
> Thanks
>
> Regards,
> Akash R Nilugal
>
> On Fri, Aug 16, 2019, 4:26 PM Akash Nilugal <[hidden email]> wrote:
>
> > Hi All,
> >
> > I have raised a jira and attached the design doc there .please refer
> >
> > CARBONDATA - 3492
> >
> > Regards,
> > Akash
> >
> > On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]>
> > wrote:
> >
> >> Hi Community,
> >>
> >> Currently, we have an index server which basically helps in distributed
> >> caching of the datamaps in a separate spark application.
> >>
> >> The caching of the datamaps in index server will start once the query is
> >> fired on the table for the first time, all the datamaps will be loaded
> >>
> >> if the count(*) is fired and only required will be loaded for any filter
> >> query.
> >>
> >>
> >> Here the problem or the bottleneck is, until and unless the query is
> >> fired on table, the caching won’t be done for the table datamaps.
> >>
> >> So consider a scenario where we are just loading the data to table for
> >> whole day and then next day we query,
> >>
> >> so all the segments will start loading into cache. So first time the
> >> query will be slow.
> >>
> >>
> >> What if we load the datamaps into cache or preprime the cache without
> >> waititng for any query on the table?
> >>
> >> Yes, what if we load the cache after every load is done, what if we load
> >> the cache for all the segments at once,
> >>
> >> so that first time query need not do all this job, which makes it faster.
> >>
> >>
> >> Here i have attached the design document for the pre-priming of cache
> >> into index server. Please have a look at it
> >>
> >> and any suggestions or inputs on this are most welcomed.
> >>
> >>
> >>
> >> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> >>
> >>
> >>
> >> Regards,
> >>
> >> Akash R Nilugal
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashrn5


On 2019/08/19 09:53:10, Manhua <[hidden email]> wrote:

> Hi Akash,
>
> 1. cache will be full when loading is still running all the time. the reason I mention the invalidation is to avoid case, specially, when cache is full before all targeted index is loaded.
>
> When server just starting, keeping pre-prime and swap out the earliest loaded index is not good.
> Maybe pre-prime need to check the capacity of available cache before load index, else stop pre-prime any more?
>
> 2. I think regex/wildcard is more flexible to use,
> such as :
> *.* for all dbs and tables
> test.* for all tables in test db
> test.day_table_201908* for table has targeted prefix
>
> 3. yes, you are right, fire a count(*) can do that.
>
>
> On 2019/08/19 09:23:06, Akash Nilugal <[hidden email]> wrote:
> > Hi manhua,
> >
> > Thanks for the inputs.
> >
> > 1. No need to take care separately to invalidate the cache, i agree that it
> > will have limit. Since we already have eviction policy, when next query
> > comes, whenever required, it will evict and load the segments required, so
> > better not to have a separate mechanism to invalidate cache during
> > pre-prime.
> >
> > 2.
> > i. For configuration support of pre-prime, already we can have the database
> > name or table name, about the regex support, we will note it, and based on
> > other use case and impacts, i will update the design document.
> > ii. During load no need to load the table or read any configuration for
> > pre-prime. During load pre-prime, just take the current new segment and
> > load into cache.
> >
> > 3. For command support, can you please explain with more use cases. Because
> > current index server startup will load, and when you say command, even if i
> > do count(*) also, that will load all the segments. So i think new command
> > won't be necessary.
> >
> > Please get back for any clarifications or doubts.
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
> > On Fri, Aug 16, 2019, 4:26 PM Akash Nilugal <[hidden email]> wrote:
> >
> > > Hi All,
> > >
> > > I have raised a jira and attached the design doc there .please refer
> > >
> > > CARBONDATA - 3492
> > >
> > > Regards,
> > > Akash
> > >
> > > On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]>
> > > wrote:
> > >
> > >> Hi Community,
> > >>
> > >> Currently, we have an index server which basically helps in distributed
> > >> caching of the datamaps in a separate spark application.
> > >>
> > >> The caching of the datamaps in index server will start once the query is
> > >> fired on the table for the first time, all the datamaps will be loaded
> > >>
> > >> if the count(*) is fired and only required will be loaded for any filter
> > >> query.
> > >>
> > >>
> > >> Here the problem or the bottleneck is, until and unless the query is
> > >> fired on table, the caching won’t be done for the table datamaps.
> > >>
> > >> So consider a scenario where we are just loading the data to table for
> > >> whole day and then next day we query,
> > >>
> > >> so all the segments will start loading into cache. So first time the
> > >> query will be slow.
> > >>
> > >>
> > >> What if we load the datamaps into cache or preprime the cache without
> > >> waititng for any query on the table?
> > >>
> > >> Yes, what if we load the cache after every load is done, what if we load
> > >> the cache for all the segments at once,
> > >>
> > >> so that first time query need not do all this job, which makes it faster.
> > >>
> > >>
> > >> Here i have attached the design document for the pre-priming of cache
> > >> into index server. Please have a look at it
> > >>
> > >> and any suggestions or inputs on this are most welcomed.
> > >>
> > >>
> > >>
> > >> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> > >>
> > >>
> > >>
> > >> Regards,
> > >>
> > >> Akash R Nilugal
> > >>
> > >
> >
> Hi Manhua,

1. You are right that size will be full at one point, and according to you if we stop pre-priming, then query will go and try to load cache and if it does not get the size,
it will evict and do, so even pre-prime does the same thing LRU will handle that for us. I will still think on this and let you know and if feasible i will update the design.

May be pre-priming we can stop once size is full, i 'll update this once finalised.


2. Wild card support is also fine according to your input, initial stage load and pre-prime is first and then regex support we can provide once after this.

Thank you for the suggestion

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by Manhua-2
Hi Manhua,

1. You are right that size will be full at one point, and according to you if we stop pre-priming, then query will go and try to load cache and if it does not get the size,
it will evict and do, so even pre-prime does the same thing LRU will handle that for us. I will still think on this and let you know and if feasible i will update the design.

May be pre-priming we can stop once size is full, i 'll update this once finalised.


2. Wild card support is also fine according to your input, initial stage load and pre-prime is first and then regex support we can provide once after this.

Thank you for the suggestion

On 2019/08/19 09:53:10, Manhua <[hidden email]> wrote:

> Hi Akash,
>
> 1. cache will be full when loading is still running all the time. the reason I mention the invalidation is to avoid case, specially, when cache is full before all targeted index is loaded.
>
> When server just starting, keeping pre-prime and swap out the earliest loaded index is not good.
> Maybe pre-prime need to check the capacity of available cache before load index, else stop pre-prime any more?
>
> 2. I think regex/wildcard is more flexible to use,
> such as :
> *.* for all dbs and tables
> test.* for all tables in test db
> test.day_table_201908* for table has targeted prefix
>
> 3. yes, you are right, fire a count(*) can do that.
>
>
> On 2019/08/19 09:23:06, Akash Nilugal <[hidden email]> wrote:
> > Hi manhua,
> >
> > Thanks for the inputs.
> >
> > 1. No need to take care separately to invalidate the cache, i agree that it
> > will have limit. Since we already have eviction policy, when next query
> > comes, whenever required, it will evict and load the segments required, so
> > better not to have a separate mechanism to invalidate cache during
> > pre-prime.
> >
> > 2.
> > i. For configuration support of pre-prime, already we can have the database
> > name or table name, about the regex support, we will note it, and based on
> > other use case and impacts, i will update the design document.
> > ii. During load no need to load the table or read any configuration for
> > pre-prime. During load pre-prime, just take the current new segment and
> > load into cache.
> >
> > 3. For command support, can you please explain with more use cases. Because
> > current index server startup will load, and when you say command, even if i
> > do count(*) also, that will load all the segments. So i think new command
> > won't be necessary.
> >
> > Please get back for any clarifications or doubts.
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
> > On Fri, Aug 16, 2019, 4:26 PM Akash Nilugal <[hidden email]> wrote:
> >
> > > Hi All,
> > >
> > > I have raised a jira and attached the design doc there .please refer
> > >
> > > CARBONDATA - 3492
> > >
> > > Regards,
> > > Akash
> > >
> > > On Thu, Aug 15, 2019, 5:33 PM Akash Nilugal <[hidden email]>
> > > wrote:
> > >
> > >> Hi Community,
> > >>
> > >> Currently, we have an index server which basically helps in distributed
> > >> caching of the datamaps in a separate spark application.
> > >>
> > >> The caching of the datamaps in index server will start once the query is
> > >> fired on the table for the first time, all the datamaps will be loaded
> > >>
> > >> if the count(*) is fired and only required will be loaded for any filter
> > >> query.
> > >>
> > >>
> > >> Here the problem or the bottleneck is, until and unless the query is
> > >> fired on table, the caching won’t be done for the table datamaps.
> > >>
> > >> So consider a scenario where we are just loading the data to table for
> > >> whole day and then next day we query,
> > >>
> > >> so all the segments will start loading into cache. So first time the
> > >> query will be slow.
> > >>
> > >>
> > >> What if we load the datamaps into cache or preprime the cache without
> > >> waititng for any query on the table?
> > >>
> > >> Yes, what if we load the cache after every load is done, what if we load
> > >> the cache for all the segments at once,
> > >>
> > >> so that first time query need not do all this job, which makes it faster.
> > >>
> > >>
> > >> Here i have attached the design document for the pre-priming of cache
> > >> into index server. Please have a look at it
> > >>
> > >> and any suggestions or inputs on this are most welcomed.
> > >>
> > >>
> > >>
> > >> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> > >>
> > >>
> > >>
> > >> Regards,
> > >>
> > >> Akash R Nilugal
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

litao
In reply to this post by akashnilugal@gmail.com
  hi Akash,
    I have a few questions.
    1. About the ways to Pre-Prime: there are 2 ways. one is cache when data loading, another is when the cace server started.
        i think the latter is not desirable,because of load cache may take more it can course the cache server long time no response. For the first type need some data support. it may take more time then data loading cache the index data. Although threads are started, there will still be a lot of IO and computing overhead.It may slow down the data loading speed. so the first type need some detail data, How big is the data index file?How much impact does it have on loading?
        Should we provide a third way, the way of interface triggers? User-triggered cache loading can be provided. Users can specify when the system is free, such as triggering loading late at night.
     2.About Configuration
     Could you please give an example of the use of carbon.index.server.pre.prime.
     3.About Datamap Table Loading or Child Table Loading to Cache
      i think this point is very important, more detailed description is needed. such as the update and delete happen, how about the cache change. when drop or create new mv how about the cache changed?etc.
      4.About Rebuild Command
      what do we need to do when use this command, first clear the cache data, then loading the cache again? does this command can be executed many times。
      5. About Compaction
      Does like the rebuild before,we need to decide which cache should be clear and another segments's cache need be loaded?
On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:

> Hi Community,
>
> Currently, we have an index server which basically helps in distributed
> caching of the datamaps in a separate spark application.
>
> The caching of the datamaps in index server will start once the query is
> fired on the table for the first time, all the datamaps will be loaded
>
> if the count(*) is fired and only required will be loaded for any filter
> query.
>
>
> Here the problem or the bottleneck is, until and unless the query is fired
> on table, the caching won’t be done for the table datamaps.
>
> So consider a scenario where we are just loading the data to table for
> whole day and then next day we query,
>
> so all the segments will start loading into cache. So first time the query
> will be slow.
>
>
> What if we load the datamaps into cache or preprime the cache without
> waititng for any query on the table?
>
> Yes, what if we load the cache after every load is done, what if we load
> the cache for all the segments at once,
>
> so that first time query need not do all this job, which makes it faster.
>
>
> Here i have attached the design document for the pre-priming of cache into
> index server. Please have a look at it
>
> and any suggestions or inputs on this are most welcomed.
>
>
> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
>
>
>
> Regards,
>
> Akash R Nilugal
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

litao
In reply to this post by akashnilugal@gmail.com
hi Akash,
      Before development, we need to know how much improvement can be made to queries by caching part of the index in advance.
      We need to compare the first and second query and analyze them. We need to find time differences for several important steps.
      It can analyze the performance improvement that can be brought by caching part of the index in advance.

On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:

> Hi Community,
>
> Currently, we have an index server which basically helps in distributed
> caching of the datamaps in a separate spark application.
>
> The caching of the datamaps in index server will start once the query is
> fired on the table for the first time, all the datamaps will be loaded
>
> if the count(*) is fired and only required will be loaded for any filter
> query.
>
>
> Here the problem or the bottleneck is, until and unless the query is fired
> on table, the caching won’t be done for the table datamaps.
>
> So consider a scenario where we are just loading the data to table for
> whole day and then next day we query,
>
> so all the segments will start loading into cache. So first time the query
> will be slow.
>
>
> What if we load the datamaps into cache or preprime the cache without
> waititng for any query on the table?
>
> Yes, what if we load the cache after every load is done, what if we load
> the cache for all the segments at once,
>
> so that first time query need not do all this job, which makes it faster.
>
>
> Here i have attached the design document for the pre-priming of cache into
> index server. Please have a look at it
>
> and any suggestions or inputs on this are most welcomed.
>
>
> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
>
>
>
> Regards,
>
> Akash R Nilugal
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by litao


On 2019/08/21 02:39:45, tao li <[hidden email]> wrote:

>   hi Akash,
>     I have a few questions.
>     1. About the ways to Pre-Prime: there are 2 ways. one is cache when data loading, another is when the cace server started.
>         i think the latter is not desirable,because of load cache may take more it can course the cache server long time no response. For the first type need some data support. it may take more time then data loading cache the index data. Although threads are started, there will still be a lot of IO and computing overhead.It may slow down the data loading speed. so the first type need some detail data, How big is the data index file?How much impact does it have on loading?
>         Should we provide a third way, the way of interface triggers? User-triggered cache loading can be provided. Users can specify when the system is free, such as triggering loading late at night.
>      2.About Configuration
>      Could you please give an example of the use of carbon.index.server.pre.prime.
>      3.About Datamap Table Loading or Child Table Loading to Cache
>       i think this point is very important, more detailed description is needed. such as the update and delete happen, how about the cache change. when drop or create new mv how about the cache changed?etc.
>       4.About Rebuild Command
>       what do we need to do when use this command, first clear the cache data, then loading the cache again? does this command can be executed many times。
>       5. About Compaction
>       Does like the rebuild before,we need to decide which cache should be clear and another segments's cache need be loaded?
> On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:
> > Hi Community,
> >
> > Currently, we have an index server which basically helps in distributed
> > caching of the datamaps in a separate spark application.
> >
> > The caching of the datamaps in index server will start once the query is
> > fired on the table for the first time, all the datamaps will be loaded
> >
> > if the count(*) is fired and only required will be loaded for any filter
> > query.
> >
> >
> > Here the problem or the bottleneck is, until and unless the query is fired
> > on table, the caching won’t be done for the table datamaps.
> >
> > So consider a scenario where we are just loading the data to table for
> > whole day and then next day we query,
> >
> > so all the segments will start loading into cache. So first time the query
> > will be slow.
> >
> >
> > What if we load the datamaps into cache or preprime the cache without
> > waititng for any query on the table?
> >
> > Yes, what if we load the cache after every load is done, what if we load
> > the cache for all the segments at once,
> >
> > so that first time query need not do all this job, which makes it faster.
> >
> >
> > Here i have attached the design document for the pre-priming of cache into
> > index server. Please have a look at it
> >
> > and any suggestions or inputs on this are most welcomed.
> >
> >
> > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> >
> >
> >
> > Regards,
> >
> > Akash R Nilugal
> >
> Hi Litao,

1. I think i didnt understand the point you are teling about the first way, we just load the only segment loaded in that load and not all the segments, so it will not affect the load performance much. And second way of configurations, it will be in configured way right, so only if configured it will load, else you can leave it for query to take care.

you said the third way, which is user interface to run at night, or less traffic time. It is like running count(*) at night right, no need to expose any extra operation for that.

2. About the configuration, it is like configure the value for this property like the way said in the main chain. So it will load into cache based on that values.

3. About this point i have updated the design document with more description, please refer jira for it and get back for any clarifications.

4. rebuid command is only helpful to build lazy mv datamap, currently we have only MV as lazy and as well as non lazy datamap, remaining all are non-lazy, so whenever rebuild is called, if the MV is not in sync with main table segments, it will load that data to MV and load this new MV segment to cache.

5. As already explained in design document, once compaction is done, we will invalidated the compacted segments from cache and load the new segment into cache.

please get back for any clarifications or inputs.

Thanks,

Akash R
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by litao
Hi Litao,

Initially with first time count(*) , it used to take around 32seconds as it used to load into cache, and second time query takes 1.5sec to 2 i think, so with pre-prime we can achieve more improvement in first time query.

Regards,
Akash

On 2019/08/21 03:03:55, tao li <[hidden email]> wrote:

> hi Akash,
>       Before development, we need to know how much improvement can be made to queries by caching part of the index in advance.
>       We need to compare the first and second query and analyze them. We need to find time differences for several important steps.
>       It can analyze the performance improvement that can be brought by caching part of the index in advance.
>
> On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:
> > Hi Community,
> >
> > Currently, we have an index server which basically helps in distributed
> > caching of the datamaps in a separate spark application.
> >
> > The caching of the datamaps in index server will start once the query is
> > fired on the table for the first time, all the datamaps will be loaded
> >
> > if the count(*) is fired and only required will be loaded for any filter
> > query.
> >
> >
> > Here the problem or the bottleneck is, until and unless the query is fired
> > on table, the caching won’t be done for the table datamaps.
> >
> > So consider a scenario where we are just loading the data to table for
> > whole day and then next day we query,
> >
> > so all the segments will start loading into cache. So first time the query
> > will be slow.
> >
> >
> > What if we load the datamaps into cache or preprime the cache without
> > waititng for any query on the table?
> >
> > Yes, what if we load the cache after every load is done, what if we load
> > the cache for all the segments at once,
> >
> > so that first time query need not do all this job, which makes it faster.
> >
> >
> > Here i have attached the design document for the pre-priming of cache into
> > index server. Please have a look at it
> >
> > and any suggestions or inputs on this are most welcomed.
> >
> >
> > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> >
> >
> >
> > Regards,
> >
> > Akash R Nilugal
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

litao
hi, akash
    How much of the performance difference between the first and second querys is affected by caching index and how much is affected by Hadoop caching.
    We should open it up and take a look at the time-consuming analysis on the driver side.

On 2019/08/21 09:42:10, Akash Nilugal <[hidden email]> wrote:

> Hi Litao,
>
> Initially with first time count(*) , it used to take around 32seconds as it used to load into cache, and second time query takes 1.5sec to 2 i think, so with pre-prime we can achieve more improvement in first time query.
>
> Regards,
> Akash
>
> On 2019/08/21 03:03:55, tao li <[hidden email]> wrote:
> > hi Akash,
> >       Before development, we need to know how much improvement can be made to queries by caching part of the index in advance.
> >       We need to compare the first and second query and analyze them. We need to find time differences for several important steps.
> >       It can analyze the performance improvement that can be brought by caching part of the index in advance.
> >
> > On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:
> > > Hi Community,
> > >
> > > Currently, we have an index server which basically helps in distributed
> > > caching of the datamaps in a separate spark application.
> > >
> > > The caching of the datamaps in index server will start once the query is
> > > fired on the table for the first time, all the datamaps will be loaded
> > >
> > > if the count(*) is fired and only required will be loaded for any filter
> > > query.
> > >
> > >
> > > Here the problem or the bottleneck is, until and unless the query is fired
> > > on table, the caching won’t be done for the table datamaps.
> > >
> > > So consider a scenario where we are just loading the data to table for
> > > whole day and then next day we query,
> > >
> > > so all the segments will start loading into cache. So first time the query
> > > will be slow.
> > >
> > >
> > > What if we load the datamaps into cache or preprime the cache without
> > > waititng for any query on the table?
> > >
> > > Yes, what if we load the cache after every load is done, what if we load
> > > the cache for all the segments at once,
> > >
> > > so that first time query need not do all this job, which makes it faster.
> > >
> > >
> > > Here i have attached the design document for the pre-priming of cache into
> > > index server. Please have a look at it
> > >
> > > and any suggestions or inputs on this are most welcomed.
> > >
> > >
> > > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> > >
> > >
> > >
> > > Regards,
> > >
> > > Akash R Nilugal
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

litao
In reply to this post by akashnilugal@gmail.com
hi akash
    count(*) can only load one table,if the table is very more, it is better we can have a command to trigger the cache load.

On 2019/08/21 09:42:10, Akash Nilugal <[hidden email]> wrote:

> Hi Litao,
>
> Initially with first time count(*) , it used to take around 32seconds as it used to load into cache, and second time query takes 1.5sec to 2 i think, so with pre-prime we can achieve more improvement in first time query.
>
> Regards,
> Akash
>
> On 2019/08/21 03:03:55, tao li <[hidden email]> wrote:
> > hi Akash,
> >       Before development, we need to know how much improvement can be made to queries by caching part of the index in advance.
> >       We need to compare the first and second query and analyze them. We need to find time differences for several important steps.
> >       It can analyze the performance improvement that can be brought by caching part of the index in advance.
> >
> > On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:
> > > Hi Community,
> > >
> > > Currently, we have an index server which basically helps in distributed
> > > caching of the datamaps in a separate spark application.
> > >
> > > The caching of the datamaps in index server will start once the query is
> > > fired on the table for the first time, all the datamaps will be loaded
> > >
> > > if the count(*) is fired and only required will be loaded for any filter
> > > query.
> > >
> > >
> > > Here the problem or the bottleneck is, until and unless the query is fired
> > > on table, the caching won’t be done for the table datamaps.
> > >
> > > So consider a scenario where we are just loading the data to table for
> > > whole day and then next day we query,
> > >
> > > so all the segments will start loading into cache. So first time the query
> > > will be slow.
> > >
> > >
> > > What if we load the datamaps into cache or preprime the cache without
> > > waititng for any query on the table?
> > >
> > > Yes, what if we load the cache after every load is done, what if we load
> > > the cache for all the segments at once,
> > >
> > > so that first time query need not do all this job, which makes it faster.
> > >
> > >
> > > Here i have attached the design document for the pre-priming of cache into
> > > index server. Please have a look at it
> > >
> > > and any suggestions or inputs on this are most welcomed.
> > >
> > >
> > > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> > >
> > >
> > >
> > > Regards,
> > >
> > > Akash R Nilugal
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
In reply to this post by litao
Hi litao,

Basically
If total first time query takes x amount of time and in that y time takes for to connect to index server , cache and return, then with pre prime we can save this y time may if all the segments are not loaded then we can save less than Y time, so we will get the benefit, benchmark we can do later.

For data loading time, since we will do this loading to cache async way, ,it wont affect loading.

What you said about hadoop cache, i didn't get, can you please elaborate what exactly you mean by it.
 
About the command to load all tables, may be i will consider the feasibility and then include in design and implementation.
I will create sub jira task for , loading into cache after data load, configuration type of load and command. Then based on priority we can take up the tasks.

Regards,
Akash

On 2019/08/21 10:43:39, tao li <[hidden email]> wrote:

> hi, akash
>     How much of the performance difference between the first and second querys is affected by caching index and how much is affected by Hadoop caching.
>     We should open it up and take a look at the time-consuming analysis on the driver side.
>
> On 2019/08/21 09:42:10, Akash Nilugal <[hidden email]> wrote:
> > Hi Litao,
> >
> > Initially with first time count(*) , it used to take around 32seconds as it used to load into cache, and second time query takes 1.5sec to 2 i think, so with pre-prime we can achieve more improvement in first time query.
> >
> > Regards,
> > Akash
> >
> > On 2019/08/21 03:03:55, tao li <[hidden email]> wrote:
> > > hi Akash,
> > >       Before development, we need to know how much improvement can be made to queries by caching part of the index in advance.
> > >       We need to compare the first and second query and analyze them. We need to find time differences for several important steps.
> > >       It can analyze the performance improvement that can be brought by caching part of the index in advance.
> > >
> > > On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:
> > > > Hi Community,
> > > >
> > > > Currently, we have an index server which basically helps in distributed
> > > > caching of the datamaps in a separate spark application.
> > > >
> > > > The caching of the datamaps in index server will start once the query is
> > > > fired on the table for the first time, all the datamaps will be loaded
> > > >
> > > > if the count(*) is fired and only required will be loaded for any filter
> > > > query.
> > > >
> > > >
> > > > Here the problem or the bottleneck is, until and unless the query is fired
> > > > on table, the caching won’t be done for the table datamaps.
> > > >
> > > > So consider a scenario where we are just loading the data to table for
> > > > whole day and then next day we query,
> > > >
> > > > so all the segments will start loading into cache. So first time the query
> > > > will be slow.
> > > >
> > > >
> > > > What if we load the datamaps into cache or preprime the cache without
> > > > waititng for any query on the table?
> > > >
> > > > Yes, what if we load the cache after every load is done, what if we load
> > > > the cache for all the segments at once,
> > > >
> > > > so that first time query need not do all this job, which makes it faster.
> > > >
> > > >
> > > > Here i have attached the design document for the pre-priming of cache into
> > > > index server. Please have a look at it
> > > >
> > > > and any suggestions or inputs on this are most welcomed.
> > > >
> > > >
> > > > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Akash R Nilugal
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

chetdb
In reply to this post by akashnilugal@gmail.com
Hi Akash,

1. Will the performance of end to end dataload operation be impacted if the segment datamap is loaded to cache once the load is finished.
2. Will there be a notification in logs stating that the loading of datamap cache is completed.

Regards

On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:

> Hi Community,
>
> Currently, we have an index server which basically helps in distributed
> caching of the datamaps in a separate spark application.
>
> The caching of the datamaps in index server will start once the query is
> fired on the table for the first time, all the datamaps will be loaded
>
> if the count(*) is fired and only required will be loaded for any filter
> query.
>
>
> Here the problem or the bottleneck is, until and unless the query is fired
> on table, the caching won’t be done for the table datamaps.
>
> So consider a scenario where we are just loading the data to table for
> whole day and then next day we query,
>
> so all the segments will start loading into cache. So first time the query
> will be slow.
>
>
> What if we load the datamaps into cache or preprime the cache without
> waititng for any query on the table?
>
> Yes, what if we load the cache after every load is done, what if we load
> the cache for all the segments at once,
>
> so that first time query need not do all this job, which makes it faster.
>
>
> Here i have attached the design document for the pre-priming of cache into
> index server. Please have a look at it
>
> and any suggestions or inputs on this are most welcomed.
>
>
> https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
>
>
>
> Regards,
>
> Akash R Nilugal
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Cache Pre Priming

akashnilugal@gmail.com
Hi chetan,

As mentioned in design , loading to cache will be an asyc operation, and we will load only the corresponding segment to cache, so there wont be any hit.
Logs will be added

On 2019/08/21 13:18:05, chetan bhat <[hidden email]> wrote:

> Hi Akash,
>
> 1. Will the performance of end to end dataload operation be impacted if the segment datamap is loaded to cache once the load is finished.
> 2. Will there be a notification in logs stating that the loading of datamap cache is completed.
>
> Regards
>
> On 2019/08/15 12:03:09, Akash Nilugal <[hidden email]> wrote:
> > Hi Community,
> >
> > Currently, we have an index server which basically helps in distributed
> > caching of the datamaps in a separate spark application.
> >
> > The caching of the datamaps in index server will start once the query is
> > fired on the table for the first time, all the datamaps will be loaded
> >
> > if the count(*) is fired and only required will be loaded for any filter
> > query.
> >
> >
> > Here the problem or the bottleneck is, until and unless the query is fired
> > on table, the caching won’t be done for the table datamaps.
> >
> > So consider a scenario where we are just loading the data to table for
> > whole day and then next day we query,
> >
> > so all the segments will start loading into cache. So first time the query
> > will be slow.
> >
> >
> > What if we load the datamaps into cache or preprime the cache without
> > waititng for any query on the table?
> >
> > Yes, what if we load the cache after every load is done, what if we load
> > the cache for all the segments at once,
> >
> > so that first time query need not do all this job, which makes it faster.
> >
> >
> > Here i have attached the design document for the pre-priming of cache into
> > index server. Please have a look at it
> >
> > and any suggestions or inputs on this are most welcomed.
> >
> >
> > https://drive.google.com/file/d/1YUpDUv7ZPUyZQQYwQYcQK2t2aBQH18PB/view?usp=sharing
> >
> >
> >
> > Regards,
> >
> > Akash R Nilugal
> >
>
12