Apache CarbonData Dev Mailing List archive

Re: Abstracting CarbonData's Index Interface

Posted by Venkata Gollamudi on Oct 02, 2016; 6:18pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Abstracting-CarbonData-s-Index-Interface-tp1587p1594.html

Yes Jacky, interfaces needs to be revisited.
For Goal 1 and Goal 2: abstraction required for both Index and Index store.
Also multi-column index(composite index) needs to be considered.

Regards,
Ramana

On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[hidden email]> wrote:

> Hi community,
>
> Currently CarbonData have builtin index support which is one of the key
> strength of CarbonData. Using index, CarbonData can do very fast filter
> query by filtering on block and blocklet level. However, it also introduces
> memory consumption of the index tree and impact first query time because
> the
> process of loading of index from file footer into memory. On the other
> side,
> in a multi-tennant environment, multiple applications may access data files
> simultaneously, which again exacerbate this resource consumption issue.
> So, I want to propose and discuss a solution with you to solve this
> problem and make an abstraction of interface for CarbonData's future
> evolvement.
> I am thinking the final result of this work should achieve at least two
> goals:
>
> Goal 1: User can choose the place to store Index data, it can be stored in
> processing framework's memory space (like in spark driver memory) or in
> another service outside of the processing framework (like using a
> independent database service)
>
> Goal 2: Developer can add more index of his choice to CarbonData files.
> Besides B+ tree on multi-dimensional key which current CarbonData supports,
> developers are free to add other indexing technology to make certain
> workload faster. These new indices should be added in a pluggable way.
>
> In order to achieve these goals, an abstraction need to be created for
> CarbonData project, including:
>
> - Segment: each segment is presenting one load of data, and tie with some
> indices created with this load
>
> - Index: index is created when this segment is created, and is leveraged
> when CarbonInputFormat's getSplit is called, to filter out the required
> blocks or even blocklets.
>
> - CarbonInputFormat: There maybe n number of indices created for data file,
> when querying these data files, InputFormat should know how to access these
> indices, and initialize or loading these index if required.
>
> Obviously, this work should be separated into different tasks and
> implemented gradually. But first of all, let's discuss on the goal and the
> proposed approach. What is your idea?
>
> Regards,
> Jacky
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> CarbonData-s-Index-Interface-tp1587.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>