Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION] Distributed Index Cache Server

Posted by kunalkapoor on Feb 13, 2019; 9:11am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Distributed-Index-Cache-Server-tp75008p75109.html

Hi Xuchuanyin,
Thank you for the suggestion/questions.

1. You are right the only thing in the spotlight is the pruning, the
datamaps are not important because we would support all types of datamaps.
The bloom datamap line was just an example to illustrate that for bloom we
are already using distributed datamap pruning. I will re-write the same in
a better way.

2.1 We want the index server to run in a different cluster so that it is
centralised.

2.2 We had considered the possibility of using an in-memory DB but the same
problems will happen with huge split load(1 million or more). Also other
solutions like Elasticsearch which would be much faster but the
implementation would have to be done from scratch. For now we are starting
the requirement with a less error prone method because the existing pruning
logic has to be moved from driver to executor. No new logic is being
introduced. But we can surely integrate other solutions in the future.

2.3 The start and stop of index server/client is the only new interface
that will be provided, rest all the existing interfaces will be reused. Ill
update the same in the design soon.

3. Yes Index server will support multi-tenant, we are currently trying to
figure out the best way to authorise and authenticate the access for
multiple users.

4. Yes a seperate module would be create but just to start the server and
client. The other logic would not be moved to this module.

Thanks
Kunal Kapoor

On Wed, Feb 13, 2019 at 6:59 AM xuchuanyin <[hidden email]> wrote:

> Hi Kunal,
> IndexServer is quiet an efficient method to solve the problem of index
> cache and it's great that someone finally tries to implement this. However
> after I went through your design document, I get some questions for this
> and
> I'll explain those as following:
>
> 1. For the 'backgroud' chapter, I think actually it is the type of pruning
> (distribute-pruning or not) that matters, not the type of datamaps (default
> or bloomfilter).
>
> 2. Extensibility of the IndexServer
> 2.1 In the design document, why do you finally choose 'one more spark
> cluster' as the IndexServer?
>
> 2.2 Have you considered other types of IndexServer such as a DB, another
> in-memory storage engine or even treat the current implementation as an
> embedded IndexServer? If yes, Will the base IndexServer be enough
> extensible
> to support other them during your implementation and design?
>
> 2.3 What are the interfaces that the IndexServer will expose to offer
> service? I also didn't get this info.
>
> 3. For the IndexServer, will multiple tenants also be OK?
>
> 4. During coding, will IndexServer be in a separate module?
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>