> After a second thought regarding the index part, another option is that to
> have a very simple Segment definition which can only list all files it has
> or listFile taking the QueryModel as input, implementation of Segment can
> be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In
> future, developer is free to create MultiIndexSegment to select index
> internally. Is this option better?
>
> Regards,
> Jacky
>
> > 在 2016年10月3日,上午11:00,Jacky Li <<a href="x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=1" target="_top" rel="nofollow" link="external" class="">[hidden email]> 写道:
> >
> > I am currently thinking these abstractions:
> >
> > - A SegmentManager is the global manager of all segments for one table.
> It can be used to get all segments and manage the segment while loading and
> compaction.
> > - A CarbonInputFormat will take the input of table path, so means it
> represent the whole table contain all segments. When getSplit is called,
> it will get all segments by calling SegmentManager interface.
> > - Each Segment contains a list of Index, and an IndexSelector. While
> currently CarbonData only has MDK index, developer can create multiple
> indices for each segment in the future.
> > - An Index is an interface to filtering on block/blocklet, and provide
> this functionality only. Implementation should hide all complexity like
> deciding where to store the index.
> > - An IndexSelector is an interface to choose which index to use based on
> query predicates. Default implementation is to choose the first index. An
> implementation of IndexChooser can also decide not to use index at all.
> > - A Distributor is used to map the filtered block/blocklet to
> InputSplits. Implementation can take number of node, parallelism into
> consideration. It can also decide to distribute tasks based on block or
> blocklet.
> >
> > So the main concepts are SegmentManager, Segment, Index, IndexSelector,
> InputFormat/OutputFormat, Distributor.
> >
> > There will be a default implementation of CarbonInputFormat whose
> getSplit will do the following:
> > 1. gat all segments by calling SegmentManager
> > 2. for each segment, choose the index to use by IndexSelector
> > 3. invoke the selected Index to filter out block/blocklet (since these
> are two concept, maybe a parent class need to be created to encapsulate
> them)
> > 4. distribute the filtered block/blocklet to InputSplits by Distributor.
> >
> > Regarding the input to the Index.filter interface, I have not decided to
> use the existing QueryModel or create a new cleaner QueryModel interface.
> If new QueryModel is desired, it should only contain filter predicate and
> project columns, so it is much simpler than current QueryModel. But I see
> current QueryModel is used in Compaction also, so I think it is better to
> do this clean up later?
> >
> >
> > Does this look fine to you? Any suggestion is welcome.
> >
> > Regards,
> > Jacky
> >
> >
> >> 在 2016年10月3日,上午2:18,Venkata Gollamudi <<a href="x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=2" target="_top" rel="nofollow" link="external" class="">[hidden email]> 写道:
> >>
> >> Yes Jacky, interfaces needs to be revisited.
> >> For Goal 1 and Goal 2: abstraction required for both Index and Index
> store.
> >> Also multi-column index(composite index) needs to be considered.
> >>
> >> Regards,
> >> Ramana
> >>
> >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <<a href="x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=3" target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
> >>
> >>> Hi community,
> >>>
> >>> Currently CarbonData have builtin index support which is one of the
> key
> >>> strength of CarbonData. Using index, CarbonData can do very fast filter
> >>> query by filtering on block and blocklet level. However, it also
> introduces
> >>> memory consumption of the index tree and impact first query time
> because
> >>> the
> >>> process of loading of index from file footer into memory. On the other
> >>> side,
> >>> in a multi-tennant environment, multiple applications may access data
> files
> >>> simultaneously, which again exacerbate this resource consumption issue.
> >>> So, I want to propose and discuss a solution with you to solve this
> >>> problem and make an abstraction of interface for CarbonData's future
> >>> evolvement.
> >>> I am thinking the final result of this work should achieve at least
> two
> >>> goals:
> >>>
> >>> Goal 1: User can choose the place to store Index data, it can be
> stored in
> >>> processing framework's memory space (like in spark driver memory) or in
> >>> another service outside of the processing framework (like using a
> >>> independent database service)
> >>>
> >>> Goal 2: Developer can add more index of his choice to CarbonData files.
> >>> Besides B+ tree on multi-dimensional key which current CarbonData
> supports,
> >>> developers are free to add other indexing technology to make certain
> >>> workload faster. These new indices should be added in a pluggable way.
> >>>
> >>> In order to achieve these goals, an abstraction need to be created
> for
> >>> CarbonData project, including:
> >>>
> >>> - Segment: each segment is presenting one load of data, and tie with
> some
> >>> indices created with this load
> >>>
> >>> - Index: index is created when this segment is created, and is
> leveraged
> >>> when CarbonInputFormat's getSplit is called, to filter out the required
> >>> blocks or even blocklets.
> >>>
> >>> - CarbonInputFormat: There maybe n number of indices created for data
> file,
> >>> when querying these data files, InputFormat should know how to access
> these
> >>> indices, and initialize or loading these index if required.
> >>>
> >>> Obviously, this work should be separated into different tasks and
> >>> implemented gradually. But first of all, let's discuss on the goal and
> the
> >>> proposed approach. What is your idea?
> >>>
> >>> Regards,
> >>> Jacky
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
http://apache-carbondata-> >>>
mailing-list-archive.1130556.n5.nabble.com/Abstracting- > >>> CarbonData-s-Index-Interface-tp1587.html
> >>> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >>> at
Nabble.com.
> >>>
> >
>
>