Apache CarbonData Dev Mailing List archive

[Discussion] Abstracting CarbonData's Index Interface

Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Abstracting-CarbonData-s-Index-Interface-tp1587.html

Hi community,

Currently CarbonData has builtin index support which is one of its key strength. Using index, CarbonData can perform very fast filter query by filtering on blocks and blocklets. However, it also introduces some cost like memory consumption of the index tree and impact first query time because the process of loading of index from file footer into memory. On the other hand, user way want to deploy CarbonData in a multi-tennant environment, in which case multiple applications may access data files simultaneously, which again exacerbate this resource consumption issue.
So, I want to propose and discuss a solution with you to solve this problem and make an abstraction of interface for CarbonData's future evolvement.
I am thinking the final result of this work should achieve at least two goals:

Goal 1: User can choose the place to store Index data, it can be stored in processing framework's memory space (like in spark driver memory) or in another service outside of the processing framework (like using a independent database service). These indices can be shared across applications in a scalable way.

Goal 2: Developer can add more index of his choice to CarbonData files. Besides B+ tree on multi-dimensional key which current CarbonData supports, developers are free to add other indexing technology to make certain workload faster. These new indices should be added in a pluggable way.

In order to achieve these goals, an abstraction need to be created for CarbonData project, including:

- Segment: each segment is presenting one load of data, and tie with some indices created with this load

- Index: index is created when this segment is created, and is leveraged when CarbonInputFormat's getSplit is called, to filter out the required blocks or even blocklets.

- CarbonInputFormat: There maybe n number of indices created for data file, when querying these data files, InputFormat should know how to access these indices, and initialize or loading these index if required.

Obviously, this work should be separated into different tasks and implemented gradually. But first of all, let's discuss on the goal and the proposed approach. What is your idea?

Regards,
Jacky