Apache CarbonData Dev Mailing List archive

Re: Abstracting CarbonData's Index Interface

Posted by Jacky Li on Oct 04, 2016; 3:52am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Abstracting-CarbonData-s-Index-Interface-tp1587p1611.html

> 在 2016年10月4日，上午5:43，Qingqing Zhou <[hidden email]> 写道：
>
> On Fri, Sep 30, 2016 at 10:31 PM, Jacky Li <[hidden email]> wrote:
>> However, it also introduces memory consumption of the index tree and
>> impact first query time because the process of loading of index from
>> file footer into memory. On the other side, in a multi-tennant
>> environment, multiple applications may access data files simultaneously,
>> which again exacerbate this resource consumption issue.
>>
> Agree we shall at least not rely so much on driver memory for indexing.
>
>>
>> Goal 1: User can choose the place to store Index data, it can be stored
>> in processing framework's memory space (like in spark driver memory) or
>> in another service outside of the processing framework (like using a
>> independent database service)
>>
>
> How much will be the same index on different "places" code shared? For
> example, for Btree index, if you do it inside Carbon, you are programming
> at block level and you will worry about block [de]allocation, tree balance
> etc. But if you rely on a database service, you programming at table
> level, which you are programming with relational table/index. Meanwhile,
> index is essentially a data redundancy, so updates needs careful design if
> the index is outside of your control.
>

I think we can try to reuse anything except for Index storage, like segment management, query logic processing after InputSplit is gathered by calling index interface.
I think index can be programmed in different level, what I proposed here is still a block level solution, so it can be processed in InputFormat level. If you are looking for table level indexing solution, it means that you need to manipulate the query plan to do some kind of join of two tables, so means we need to add logic into processing framework’s optimizer which I tend to avoid in CarbonData project, unless it has huge benefits. Because every optimizer is having different interface, there are no *standard* way to do it right now. Do you see any benefit of doing it in table level?

> Regards,
> Qingqing
>