Apache CarbonData Dev Mailing List archive - Re: [Discussion] Minimize the Btree size and unify the driver and executor Btrees.

Apache CarbonData Dev Mailing List archive

Re: [Discussion] Minimize the Btree size and unify the driver and executor Btrees.

Posted by kumarvishal09 on May 18, 2017; 7:31am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Minimize-the-Btree-size-and-unify-the-driver-and-executor-Btrees-tp12771p12865.html

Hi Ravi,

I have few queries related to both the solution.

1. To reduce the java heap for Btree , we can remove the Btree data
structure and use simple single array and do binary search on it. And also
we should move this cached arrays to unsafe (offheap/onheap) to reduce the
burden on GC.

Single array of object?? If user is loading small data in each load so we
will save only intermediate node of btree by maintaining in array. Yes by
maintaining all the metadata(startkey, min,max) in offheap will reduce
memory footprint. Is it possible to maintain data in single byte array and
maintaining the offset of each key for each column??

2. Unify the btree to single Btree instead of 2 and load at driver side.
So that only one lookup can be done to find the blocklets directly. And
executors are not required to load the btree for every query.

If we unify the btree to single btree, then it will be loaded in driver
side, so driver will need more memory to maintain this metadata and for
each query driver needs to do blocklet level pruning, in case of concurrent
query driver may be overloaded. And how we will send this metadata
information (offset of each column in file and other metadata) to executor,
serializing and deserializing this metadata cost also we need to consider.

Please correct me if my comments are not valid.

-Regards
Kumar Vishal

On Thu, May 18, 2017 at 12:33 PM, Ravindra Pesala <[hidden email]>
wrote:

> Hi Liang,
>
> Yes Liang , it will be done in 2 parts. At first reduce the size of the
> btree and then merge the driver side and executor btree to single btree.
>
> Regards,
> Ravindra.
>
> On 17 May 2017 at 19:28, Liang Chen <[hidden email]> wrote:
>
> > Hi Ravi
> >
> > Thank you bringing this improvement discussion to mailing list.
> >
> > One question , the point1 how to solve the below issues ? there are still
> > two part index info in driver and executor side ?
> > ------------------------------------------------------------
> > ----------------------------------------
> > And also chances of loading btree on each executor is more for every
> query
> > because there is no guarantee that same block goes to same executor every
> > time. It will be worse in case of dynamic containers.
> >
> > Regards
> > Liang
> >
> > 2017-05-17 7:33 GMT-04:00 Ravindra Pesala <[hidden email]>:
> >
> > > Hi,
> > >
> > > *1. Current problem.*
> > > 1.There is more size taking on java heap to create Btree for index
> file.
> > > It is because we create multiple objects for each leaf node so it takes
> > > more memory inside heap than actual size of index file. while doing LRU
> > > cache also we are considering only index file size instead of objects
> > size
> > > so it impacts the eviction process of LRU cache.
> > > 2. Currently we load one btree on driver side to find the blocks and
> > load
> > > another btree on executor side to find the blocklets. After we have
> > > increased the blocklet size to 128 mb and decrease the table_block size
> > to
> > > 256 mb the number of nodes inside driver side btree and executor side
> > btree
> > > is not much different. So it would be overhead to read the same
> > information
> > > twice.
> > > And also chances of loading btree on each executor is more for every
> > query
> > > because there is no guarantee that same block goes to same executor
> every
> > > time. It will be worse in case of dynamic containers.
> > >
> > > *2. Proposed solution.*
> > > 1. To reduce the java heap for Btree , we can remove the Btree data
> > > structure and use simple single array and do binary search on it. And
> > also
> > > we should move this cached arrays to unsafe (offheap/onheap) to reduce
> > the
> > > burden on GC.
> > > 2. Unify the btree to single Btree instead of 2 and load at driver
> side.
> > > So that only one lookup can be done to find the blocklets directly. And
> > > executors are not required to load the btree for every query.
> > > We can consider moving this to separate metadata service eventually
> > > once the memory footprint get reduced.
> > >
> > > First I will consider point 1 reduce the btree size after that I
> consider
> > > merging of btrees.
> > >
> > > Please comment on it.
> > >
> > > --
> > > Thanks & Regards,
> > > Ravindra.
> > >
> >
> >
> >
> > --
> > Regards
> > Liang
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>

kumar vishal