Apache CarbonData Dev Mailing List archive

Re: Abstracting CarbonData's Index Interface

Posted by kumarvishal09 on Oct 03, 2016; 10:36am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Abstracting-CarbonData-s-Index-Interface-tp1587p1600.html

Hi Jacky,
I am also changing the carbondata file thrift structure to
read less and only required data while loading the btree, Main changes will
be removing the data chunk from blocklet info and keeping only the offset
of the data chunk and some of the redundant information which is already
present in the carbonindex file i am removing from carbon data file like
segment info. Only required and valid data chunk will be read during
scanning.

-Regards
Kumar Vishal

On Mon, Oct 3, 2016 at 1:08 PM, Jacky Li <[hidden email]> wrote:

> Agreed. Shall I create a JIRA issue and PR for this abstraction?
> I think reviewing on the interface code will be clearer.
>
> Regards,
> Jacky
>
> > 在 2016年10月3日，下午2:38，Aniket Adnaik [via Apache CarbonData Mailing List
> archive] <[hidden email]> 写道：
> >
> > I would agree with having simple segment definition. Segment can use a
> > metadata info that describes the segment - For example; Segment type,
> index
> > availability, index type, index storage type (attached or
> > detached/secondary) etc. For streaming ingest segment, it also may
> possibly
> > contain min-max kind of information for each blocklet, that can used for
> > indexing.
> > So implementation details of different segment types can be hidden from
> > user.
> > We may have to think about partitioning support along with load segments
> in
> > future.
> >
> > Best Regards,
> > Aniket
> >
> >
> >
> > On Sun, Oct 2, 2016 at 10:25 PM, Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=0>> wrote:
> >
> > > After a second thought regarding the index part, another option is
> that to
> > > have a very simple Segment definition which can only list all files it
> has
> > > or listFile taking the QueryModel as input, implementation of Segment
> can
> > > be IndexSegment, MultiIndexSegment or StreamingSegment (no index). In
> > > future, developer is free to create MultiIndexSegment to select index
> > > internally. Is this option better?
> > >
> > > Regards,
> > > Jacky
> > >
> > > > 在 2016年10月3日，上午11:00，Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=1>> 写道：
> > > >
> > > > I am currently thinking these abstractions:
> > > >
> > > > - A SegmentManager is the global manager of all segments for one
> table.
> > > It can be used to get all segments and manage the segment while
> loading and
> > > compaction.
> > > > - A CarbonInputFormat will take the input of table path, so means it
> > > represent the whole table contain all segments. When getSplit is
> called,
> > > it will get all segments by calling SegmentManager interface.
> > > > - Each Segment contains a list of Index, and an IndexSelector. While
> > > currently CarbonData only has MDK index, developer can create multiple
> > > indices for each segment in the future.
> > > > - An Index is an interface to filtering on block/blocklet, and
> provide
> > > this functionality only. Implementation should hide all complexity
> like
> > > deciding where to store the index.
> > > > - An IndexSelector is an interface to choose which index to use
> based on
> > > query predicates. Default implementation is to choose the first index.
> An
> > > implementation of IndexChooser can also decide not to use index at all.
> > > > - A Distributor is used to map the filtered block/blocklet to
> > > InputSplits. Implementation can take number of node, parallelism into
> > > consideration. It can also decide to distribute tasks based on block or
> > > blocklet.
> > > >
> > > > So the main concepts are SegmentManager, Segment, Index,
> IndexSelector,
> > > InputFormat/OutputFormat, Distributor.
> > > >
> > > > There will be a default implementation of CarbonInputFormat whose
> > > getSplit will do the following:
> > > > 1. gat all segments by calling SegmentManager
> > > > 2. for each segment, choose the index to use by IndexSelector
> > > > 3. invoke the selected Index to filter out block/blocklet (since
> these
> > > are two concept, maybe a parent class need to be created to encapsulate
> > > them)
> > > > 4. distribute the filtered block/blocklet to InputSplits by
> Distributor.
> > > >
> > > > Regarding the input to the Index.filter interface, I have not
> decided to
> > > use the existing QueryModel or create a new cleaner QueryModel
> interface.
> > > If new QueryModel is desired, it should only contain filter predicate
> and
> > > project columns, so it is much simpler than current QueryModel. But I
> see
> > > current QueryModel is used in Compaction also, so I think it is better
> to
> > > do this clean up later?
> > > >
> > > >
> > > > Does this look fine to you? Any suggestion is welcome.
> > > >
> > > > Regards,
> > > > Jacky
> > > >
> > > >
> > > >> 在 2016年10月3日，上午2:18，Venkata Gollamudi <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=2>> 写道：
> > > >>
> > > >> Yes Jacky, interfaces needs to be revisited.
> > > >> For Goal 1 and Goal 2: abstraction required for both Index and Index
> > > store.
> > > >> Also multi-column index(composite index) needs to be considered.
> > > >>
> > > >> Regards,
> > > >> Ramana
> > > >>
> > > >> On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[hidden email]
> <x-msg://8/user/SendEmail.jtp?type=node&node=1598&i=3>> wrote:
> > > >>
> > > >>> Hi community,
> > > >>>
> > > >>> Currently CarbonData have builtin index support which is one of
> the
> > > key
> > > >>> strength of CarbonData. Using index, CarbonData can do very fast
> filter
> > > >>> query by filtering on block and blocklet level. However, it also
> > > introduces
> > > >>> memory consumption of the index tree and impact first query time
> > > because
> > > >>> the
> > > >>> process of loading of index from file footer into memory. On the
> other
> > > >>> side,
> > > >>> in a multi-tennant environment, multiple applications may access
> data
> > > files
> > > >>> simultaneously, which again exacerbate this resource consumption
> issue.
> > > >>> So, I want to propose and discuss a solution with you to solve
> this
> > > >>> problem and make an abstraction of interface for CarbonData's
> future
> > > >>> evolvement.
> > > >>> I am thinking the final result of this work should achieve at
> least
> > > two
> > > >>> goals:
> > > >>>
> > > >>> Goal 1: User can choose the place to store Index data, it can be
> > > stored in
> > > >>> processing framework's memory space (like in spark driver memory)
> or in
> > > >>> another service outside of the processing framework (like using a
> > > >>> independent database service)
> > > >>>
> > > >>> Goal 2: Developer can add more index of his choice to CarbonData
> files.
> > > >>> Besides B+ tree on multi-dimensional key which current CarbonData
> > > supports,
> > > >>> developers are free to add other indexing technology to make
> certain
> > > >>> workload faster. These new indices should be added in a pluggable
> way.
> > > >>>
> > > >>> In order to achieve these goals, an abstraction need to be
> created
> > > for
> > > >>> CarbonData project, including:
> > > >>>
> > > >>> - Segment: each segment is presenting one load of data, and tie
> with
> > > some
> > > >>> indices created with this load
> > > >>>
> > > >>> - Index: index is created when this segment is created, and is
> > > leveraged
> > > >>> when CarbonInputFormat's getSplit is called, to filter out the
> required
> > > >>> blocks or even blocklets.
> > > >>>
> > > >>> - CarbonInputFormat: There maybe n number of indices created for
> data
> > > file,
> > > >>> when querying these data files, InputFormat should know how to
> access
> > > these
> > > >>> indices, and initialize or loading these index if required.
> > > >>>
> > > >>> Obviously, this work should be separated into different tasks and
> > > >>> implemented gradually. But first of all, let's discuss on the goal
> and
> > > the
> > > >>> proposed approach. What is your idea?
> > > >>>
> > > >>> Regards,
> > > >>> Jacky
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> View this message in context: http://apache-carbondata- <
> http://apache-carbondata-/>
> > > >>> mailing-list-archive.1130556.n5.nabble.com/Abstracting- <
> http://mailing-list-archive.1130556.n5.nabble.com/Abstracting->
> > > >>> CarbonData-s-Index-Interface-tp1587.html
> > > >>> Sent from the Apache CarbonData Mailing List archive mailing list
> > > archive
> > > >>> at Nabble.com <http://nabble.com/>.
> > > >>>
> > > >
> > >
> > >
> >
> >
> > If you reply to this email, your message will be added to the discussion
> below:
> > http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html <
> http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587p1598.html>
> > To unsubscribe from Abstracting CarbonData's Index Interface, click here
> <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/template/NamlServlet.jtp?macro=
> unsubscribe_by_code&node=1587&code=amFja3kubGlrdW5AcXEuY29tfDE1OD
> d8LTEyNTA5Nzc4Mjg=>.
> > NAML <http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%
> 21nabble%3Aemail.naml&base=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.naml.namespaces.BasicNamespace-nabble.view.
> web.template.NabbleNamespace-nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.view.web.template.NodeNamespace&breadcrumbs=
> notify_subscribers%21nabble%3Aemail.naml-instant_emails%
> 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Abstracting-
> CarbonData-s-Index-Interface-tp1587p1599.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

kumar vishal