http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Abstracting-CarbonData-s-Index-Interface-tp1587p1600.html
segment info. Only required and valid data chunk will be read during
scanning.
> Agreed. Shall I create a JIRA issue and PR for this abstraction?
> I think reviewing on the interface code will be clearer.
>
> Regards,
> Jacky
>
> > 在 2016年10月3日,下午2:38,Aniket Adnaik [via Apache CarbonData Mailing List
> archive] <
[hidden email]> 写道:
> >
> > I would agree with having simple segment definition. Segment can use a
> > metadata info that describes the segment - For example; Segment type,
> index
> > availability, index type, index storage type (attached or
> > detached/secondary) etc. For streaming ingest segment, it also may
> possibly
> > contain min-max kind of information for each blocklet, that can used for
> > indexing.
> > So implementation details of different segment types can be hidden from
> > user.
> > We may have to think about partitioning support along with load segments
> in
> > future.
> >
> > Best Regards,
> > Aniket
> >
> >
> >
> > On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=0>> wrote:
> >
> > > After a second thought regarding the index part, another option is
> that to
> > > have a very simple Segment definition which can only list all files it
> has
> > > or listFile taking the QueryModel as input, implementation of Segment
> can
> > > be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In
> > > future, developer is free to create MultiIndexSegment to select index
> > > internally. Is this option better?
> > >
> > > Regards,
> > > Jacky
> > >
> > > > 在 2016年10月3日,上午11:00,Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=1>> 写道:
> > > >
> > > > I am currently thinking these abstractions:
> > > >
> > > > - A SegmentManager is the global manager of all segments for one
> table.
> > > It can be used to get all segments and manage the segment while
> loading and
> > > compaction.
> > > > - A CarbonInputFormat will take the input of table path, so means it
> > > represent the whole table contain all segments. When getSplit is
> called,
> > > it will get all segments by calling SegmentManager interface.
> > > > - Each Segment contains a list of Index, and an IndexSelector. While
> > > currently CarbonData only has MDK index, developer can create multiple
> > > indices for each segment in the future.
> > > > - An Index is an interface to filtering on block/blocklet, and
> provide
> > > this functionality only. Implementation should hide all complexity
> like
> > > deciding where to store the index.
> > > > - An IndexSelector is an interface to choose which index to use
> based on
> > > query predicates. Default implementation is to choose the first index.
> An
> > > implementation of IndexChooser can also decide not to use index at all.
> > > > - A Distributor is used to map the filtered block/blocklet to
> > > InputSplits. Implementation can take number of node, parallelism into
> > > consideration. It can also decide to distribute tasks based on block or
> > > blocklet.
> > > >
> > > > So the main concepts are SegmentManager, Segment, Index,
> IndexSelector,
> > > InputFormat/OutputFormat, Distributor.
> > > >
> > > > There will be a default implementation of CarbonInputFormat whose
> > > getSplit will do the following:
> > > > 1. gat all segments by calling SegmentManager
> > > > 2. for each segment, choose the index to use by IndexSelector
> > > > 3. invoke the selected Index to filter out block/blocklet (since
> these
> > > are two concept, maybe a parent class need to be created to encapsulate
> > > them)
> > > > 4. distribute the filtered block/blocklet to InputSplits by
> Distributor.
> > > >
> > > > Regarding the input to the Index.filter interface, I have not
> decided to
> > > use the existing QueryModel or create a new cleaner QueryModel
> interface.
> > > If new QueryModel is desired, it should only contain filter predicate
> and
> > > project columns, so it is much simpler than current QueryModel. But I
> see
> > > current QueryModel is used in Compaction also, so I think it is better
> to
> > > do this clean up later?
> > > >
> > > >
> > > > Does this look fine to you? Any suggestion is welcome.
> > > >
> > > > Regards,
> > > > Jacky
> > > >
> > > >
> > > >> 在 2016年10月3日,上午2:18,Venkata Gollamudi <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=2>> 写道:
> > > >>
> > > >> Yes Jacky, interfaces needs to be revisited.
> > > >> For Goal 1 and Goal 2: abstraction required for both Index and Index
> > > store.
> > > >> Also multi-column index(composite index) needs to be considered.
> > > >>
> > > >> Regards,
> > > >> Ramana
> > > >>
> > > >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=3>> wrote:
> > > >>
> > > >>> Hi community,
> > > >>>
> > > >>> Currently CarbonData have builtin index support which is one of
> the
> > > key
> > > >>> strength of CarbonData. Using index, CarbonData can do very fast
> filter
> > > >>> query by filtering on block and blocklet level. However, it also
> > > introduces
> > > >>> memory consumption of the index tree and impact first query time
> > > because
> > > >>> the
> > > >>> process of loading of index from file footer into memory. On the
> other
> > > >>> side,
> > > >>> in a multi-tennant environment, multiple applications may access
> data
> > > files
> > > >>> simultaneously, which again exacerbate this resource consumption
> issue.
> > > >>> So, I want to propose and discuss a solution with you to solve
> this
> > > >>> problem and make an abstraction of interface for CarbonData's
> future
> > > >>> evolvement.
> > > >>> I am thinking the final result of this work should achieve at
> least
> > > two
> > > >>> goals:
> > > >>>
> > > >>> Goal 1: User can choose the place to store Index data, it can be
> > > stored in
> > > >>> processing framework's memory space (like in spark driver memory)
> or in
> > > >>> another service outside of the processing framework (like using a
> > > >>> independent database service)
> > > >>>
> > > >>> Goal 2: Developer can add more index of his choice to CarbonData
> files.
> > > >>> Besides B+ tree on multi-dimensional key which current CarbonData
> > > supports,
> > > >>> developers are free to add other indexing technology to make
> certain
> > > >>> workload faster. These new indices should be added in a pluggable
> way.
> > > >>>
> > > >>> In order to achieve these goals, an abstraction need to be
> created
> > > for
> > > >>> CarbonData project, including:
> > > >>>
> > > >>> - Segment: each segment is presenting one load of data, and tie
> with
> > > some
> > > >>> indices created with this load
> > > >>>
> > > >>> - Index: index is created when this segment is created, and is
> > > leveraged
> > > >>> when CarbonInputFormat's getSplit is called, to filter out the
> required
> > > >>> blocks or even blocklets.
> > > >>>
> > > >>> - CarbonInputFormat: There maybe n number of indices created for
> data
> > > file,
> > > >>> when querying these data files, InputFormat should know how to
> access
> > > these
> > > >>> indices, and initialize or loading these index if required.
> > > >>>
> > > >>> Obviously, this work should be separated into different tasks and
> > > >>> implemented gradually. But first of all, let's discuss on the goal
> and
> > > the
> > > >>> proposed approach. What is your idea?
> > > >>>
> > > >>> Regards,
> > > >>> Jacky
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> View this message in context:
http://apache-carbondata- <
>
http://apache-carbondata-/>
> > > >>> mailing-list-archive.1130556.n5.nabble.com/Abstracting- <
>
http://mailing-list-archive.1130556.n5.nabble.com/Abstracting->
> > > >>> CarbonData-s-Index-Interface-tp1587.html
> > > >>> Sent from the Apache CarbonData Mailing List archive mailing list
> > > archive
> > > >>> at Nabble.com <
http://nabble.com/>.
> > > >>>
> > > >
> > >
> > >
> >
> >
> > If you reply to this email, your message will be added to the discussion
> below:
> >
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html <
>
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html>
> > To unsubscribe from Abstracting CarbonData's Index Interface, click here
> <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/template/NamlServlet.jtp?macro=
> unsubscribe_by_code&node=1587&code=amFja3kubGlrdW5AcXEuY29tfDE1OD
> d8LTEyNTA5Nzc4Mjg=>.
> > NAML <
http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%
> 21nabble%3Aemail.naml&base=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.view.web.template.NodeNamespace&breadcrumbs=
> notify_subscribers%21nabble%3Aemail.naml-instant_emails%
> 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> --
> View this message in context:
http://apache-carbondata-> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> CarbonData-s-Index-Interface-tp1587p1599.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>