Apache CarbonData Dev Mailing List archive - Re: Discussion(New feature) regarding single pass data loading solution.

Apache CarbonData Dev Mailing List archive

Re: Discussion(New feature) regarding single pass data loading solution.

Posted by ravipesala on Oct 14, 2016; 5:53pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1948.html

Hi,

1. Using the external tool to generate the dictionary : I think It cannot
be default solution, it is just one option to user if they are willing to
generate dictionary separately and provide to carbon while loading the data
to boost performance.

2. Using 2 pass solution(current solution) : Currently we have 2 pass
solution and this becomes bottleneck CarbonOutputFormat. And issues arise
when we use dataframe.write().

3. Using local dictionary as default implementation : we can choose this
solution but it hits query performance as late dictionary decoding cannot
work.

4. Using distributed map as default implementation: Generate the global
dictionary using distributed map solution, but need to evaluate loading
performance.

Regards,
Ravi.

On 14 October 2016 at 06:32, Aniket Adnaik <[hidden email]> wrote:

> After rethinking at point 4 in my previous email;
> It will be very expensive to rebuild and re-encode the values , so may not
> be a viable option. only future loads can benefit from it. But then will
> end up having some segments using global dictionary and some using local
> dictionary. May be we should not consider this option.
>
> Best Regards,
> Aniket
>
>
> On Thu, Oct 13, 2016 at 5:54 PM, Aniket Adnaik <[hidden email]>
> wrote:
>
> > I have following comments;
> >
> > 1. If external dictionary is provided, we accept it. This interface
> should
> > be generic enough, so that we can perform lookup, add, delete, create and
> > drop functionality. I believe we already have this functionality to some
> > extent. As long as we are able to maintain the dictionary it should be
> fine.
> > 2. If external dictionary is not provided, then by default we should
> build
> > it internally, which is our current behavior.This will continue to impact
> > the load performance though.
> > 3. If load performance is not acceptable, we should allow user to disable
> > building of global dictionary. Carbon should build local dictionary
> > instead. Will this setting apply to all subsequent loads ? may be yes for
> > now.
> > 4. If User decides to build dictionary at later point, either via
> external
> > tool
> > or using carbon sql command ("CREATE DICTIONARY TABLE...") we should
> > provide that facility. This will help user to improve query performance
> > through late materialization. The local dictionary will not be used in
> this
> > case. Sebsequent loads
> > will continue to add new entries to this new dictionary (external or
> > carbon specific).
> >
> > This doesn't really solve our double pass problem, but kind of works
> > around it by isolating dictionary building operation out of critical
> path.
> >
> >
> > Best Regards,
> > Aniket
> >
> >
> > On Thu, Oct 13, 2016 at 5:39 PM, Liang Chen <[hidden email]>
> > wrote:
> >
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >> >>>>>the question is what would be the default implementation? Load data
> >> without dictionary?
> >> >
> >> > My thought is we can provide a tool to generate global dictionary
> using
> >> > sample data set, so the initial global dictionaries is available
> before
> >> > normal data loading. We shall be able to perform encoding based on
> that,
> >> > we only need to handle occasionally adding entries while loading. For
> >> > columns specified with global dictionary encoding, but dictionary is
> not
> >> > placed before data loading, we error out and direct user to use the
> tool
> >> > first.
> >> >
> >> > Make sense?
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
> >>
> >> > ravi.pesala@
> >>
> >> > ]
> >> > Sent: Thursday, October 13, 2016 1:12 AM
> >> > To: dev
> >> > Subject: Re: Discussion(New feature) regarding single pass data
> loading
> >> > solution.
> >> >
> >> > Hi Jihong/Aniket,
> >> >
> >> > In the current implementation of carbondata we are already handling
> >> > external dictionary while loading the data.
> >> > But here the question is what would be the default implementation?
> Load
> >> > data with out dictionary?
> >> >
> >> >
> >> > Regards,
> >> > Ravi
> >> >
> >> > On 13 October 2016 at 03:50, Aniket Adnaik <
> >>
> >> > aniket.adnaik@
> >>
> >> > > wrote:
> >> >
> >> >> Hi Ravi,
> >> >>
> >> >> 1. I agree with Jihong that creation of global dictionary should be
> >> >> optional, so that it can be disabled to improve the load performance.
> >> >> User
> >> >> should be made aware that using global dictionary may boost the query
> >> >> performance.
> >> >> 2. We should have a generic interface to manage global dictionary
> when
> >> >> its
> >> >> from external sources. In general, it is not a good idea to depend on
> >> too
> >> >> many external tools.
> >> >> 3. May be we should allow user to generate global dictionary
> separately
> >> >> through SQL command or similar. Something like materialized view.
> This
> >> >> means carbon should avoid using local dictionary and do late
> >> >> materialization when global dictionary is present.
> >> >> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >> >> as we serve SELECT queries. Implementation may not be that straight
> >> >> forward. Not sure if its worth the effort.
> >> >>
> >> >> Best Regards,
> >> >> Aniket
> >> >>
> >> >>
> >> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
> >>
> >> > Jihong.Ma@
> >>
> >> > > wrote:
> >> >>
> >> >> >
> >> >> > A rather straight option is allow user to supply global dictionary
> >> >> > generated somewhere else or we build a separate tool just for
> >> >> generating
> >> >> as
> >> >> > well updating dictionary. Then the general normal data loading
> >> process
> >> >> will
> >> >> > encode columns with local dictionary if not supplied. This should
> >> >> cover
> >> >> > majority of cases for low-medium cardinality column. For the cases
> we
> >> >> have
> >> >> > to incorporate online dictionary update, use a lock mechanism to
> sync
> >> >> up
> >> >> > should serve the purpose.
> >> >> >
> >> >> > In another words, generating global dictionary is an optional step,
> >> >> only
> >> >> > triggered when needed, not a default step as we do currently.
> >> >> >
> >> >> > Jihong
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Ravindra Pesala [mailto:
> >>
> >> > ravi.pesala@
> >>
> >> > ]
> >> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> >> > To: dev
> >> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> >> > solution.
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > This discussion is regarding single pass data load solution.
> >> >> >
> >> >> > Currently data is loading to carbon in 2 pass/jobs
> >> >> > 1. Generating global dictionary using spark job.
> >> >> > 2. Encode the data with dictionary values and create carbondata
> >> files.
> >> >> > This 2 pass solution has many disadvantages like it needs to read
> the
> >> >> data
> >> >> > twice in case of csv files input or it needs to execute dataframe
> >> twice
> >> >> if
> >> >> > data is loaded from dataframe.
> >> >> >
> >> >> > In order to overcome from above issues of 2 pass dataloading, we
> can
> >> >> have
> >> >> > single pass dataloading and following are the alternate solutions.
> >> >> >
> >> >> > Use local dictionary
> >> >> > Use local dictionary for each carbondata file while loading data,
> >> but
> >> >> it
> >> >> > may lead to query performance degradation and more memory
> footprint.
> >> >> >
> >> >> > Use KV store/distributed map.
> >> >> > *HBase/Cassandra cluster : *
> >> >> > Dictionary data would be stored in KV store and generates the
> >> >> dictionary
> >> >> > value if it is not present in it. We all know the pros/cons of
> Hbase
> >> >> but
> >> >> > following are few.
> >> >> > Pros : These are apache licensed
> >> >> > Easy to implement to store/retreive dictionary values.
> >> >> > Performance need to be evaluated.
> >> >> >
> >> >> > Cons : Need to maintain seperate cluster for maintaining global
> >> >> > dictionary.
> >> >> >
> >> >> > *Hazlecast distributed map : *
> >> >> > Dictionary data could be saved in distributed concurrent hash map
> >> of
> >> >> > hazlecast. It is in-memory map and partioned as per number of
> nodes.
> >> >> And
> >> >> > even we can maintain the backups using sync/async functionality to
> >> >> avoid
> >> >> > the data loss when instance is down. We no need to maintain
> seperate
> >> >> > cluster for it as it can run on executor jvm itself.
> >> >> > Pros: It is apache licensed.
> >> >> > No need to maintain seperate cluster as instances can run
> in
> >> >> > executor jvms.
> >> >> > Easy to implement and store/retreive dictionary values.
> >> >> > It is pure java implementation.
> >> >> > There is no master/slave concept and no single point
> failure.
> >> >> >
> >> >> > Cons: Performance need to be evaluated.
> >> >> >
> >> >> > *Redis distributed map : *
> >> >> > It is also in-memory map but it is coded in c language so we
> >> should
> >> >> > have java client libraries to interact with redis. Need to maintain
> >> >> > seperate cluster for it. It also can partition the data.
> >> >> > Pros : More feature rich than Hazlecast.
> >> >> > Easy to implement and store/retreive dictionary values.
> >> >> > Cons : Need to maintain seperate cluster for maintaining global
> >> >> > dictionary.
> >> >> > May not be suitable for big data stack.
> >> >> > It is BSD licensed (Not sure whether we can use or not)
> >> >> > Online performance figures says it is little slower than
> hazlecast.
> >> >> >
> >> >> > Please let me know which would be best fit for our loading
> solution.
> >> >> And
> >> >> > please add any other suitable solution if I missed.
> >> >> > --
> >> >> > Thanks & Regards,
> >> >> > Ravi
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-maili
> >> ng-list-archive.1130556.n5.nabble.com/Discussion-New-feat
> >> ure-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
>

--
Thanks & Regards,
Ravi