http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1948.html
1. Using the external tool to generate the dictionary : I think It cannot
to boost performance.
2. Using 2 pass solution(current solution) : Currently we have 2 pass
solution and this becomes bottleneck CarbonOutputFormat. And issues arise
when we use dataframe.write().
3. Using local dictionary as default implementation : we can choose this
work.
4. Using distributed map as default implementation: Generate the global
performance.
Ravi.
> After rethinking at point 4 in my previous email;
> It will be very expensive to rebuild and re-encode the values , so may not
> be a viable option. only future loads can benefit from it. But then will
> end up having some segments using global dictionary and some using local
> dictionary. May be we should not consider this option.
>
> Best Regards,
> Aniket
>
>
> On Thu, Oct 13, 2016 at 5:54 PM, Aniket Adnaik <
[hidden email]>
> wrote:
>
> > I have following comments;
> >
> > 1. If external dictionary is provided, we accept it. This interface
> should
> > be generic enough, so that we can perform lookup, add, delete, create and
> > drop functionality. I believe we already have this functionality to some
> > extent. As long as we are able to maintain the dictionary it should be
> fine.
> > 2. If external dictionary is not provided, then by default we should
> build
> > it internally, which is our current behavior.This will continue to impact
> > the load performance though.
> > 3. If load performance is not acceptable, we should allow user to disable
> > building of global dictionary. Carbon should build local dictionary
> > instead. Will this setting apply to all subsequent loads ? may be yes for
> > now.
> > 4. If User decides to build dictionary at later point, either via
> external
> > tool
> > or using carbon sql command ("CREATE DICTIONARY TABLE...") we should
> > provide that facility. This will help user to improve query performance
> > through late materialization. The local dictionary will not be used in
> this
> > case. Sebsequent loads
> > will continue to add new entries to this new dictionary (external or
> > carbon specific).
> >
> > This doesn't really solve our double pass problem, but kind of works
> > around it by isolating dictionary building operation out of critical
> path.
> >
> >
> > Best Regards,
> > Aniket
> >
> >
> > On Thu, Oct 13, 2016 at 5:39 PM, Liang Chen <
[hidden email]>
> > wrote:
> >
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >> >>>>>the question is what would be the default implementation? Load data
> >> without dictionary?
> >> >
> >> > My thought is we can provide a tool to generate global dictionary
> using
> >> > sample data set, so the initial global dictionaries is available
> before
> >> > normal data loading. We shall be able to perform encoding based on
> that,
> >> > we only need to handle occasionally adding entries while loading. For
> >> > columns specified with global dictionary encoding, but dictionary is
> not
> >> > placed before data loading, we error out and direct user to use the
> tool
> >> > first.
> >> >
> >> > Make sense?
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
> >>
> >> > ravi.pesala@
> >>
> >> > ]
> >> > Sent: Thursday, October 13, 2016 1:12 AM
> >> > To: dev
> >> > Subject: Re: Discussion(New feature) regarding single pass data
> loading
> >> > solution.
> >> >
> >> > Hi Jihong/Aniket,
> >> >
> >> > In the current implementation of carbondata we are already handling
> >> > external dictionary while loading the data.
> >> > But here the question is what would be the default implementation?
> Load
> >> > data with out dictionary?
> >> >
> >> >
> >> > Regards,
> >> > Ravi
> >> >
> >> > On 13 October 2016 at 03:50, Aniket Adnaik <
> >>
> >> > aniket.adnaik@
> >>
> >> > > wrote:
> >> >
> >> >> Hi Ravi,
> >> >>
> >> >> 1. I agree with Jihong that creation of global dictionary should be
> >> >> optional, so that it can be disabled to improve the load performance.
> >> >> User
> >> >> should be made aware that using global dictionary may boost the query
> >> >> performance.
> >> >> 2. We should have a generic interface to manage global dictionary
> when
> >> >> its
> >> >> from external sources. In general, it is not a good idea to depend on
> >> too
> >> >> many external tools.
> >> >> 3. May be we should allow user to generate global dictionary
> separately
> >> >> through SQL command or similar. Something like materialized view.
> This
> >> >> means carbon should avoid using local dictionary and do late
> >> >> materialization when global dictionary is present.
> >> >> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >> >> as we serve SELECT queries. Implementation may not be that straight
> >> >> forward. Not sure if its worth the effort.
> >> >>
> >> >> Best Regards,
> >> >> Aniket
> >> >>
> >> >>
> >> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
> >>
> >> > Jihong.Ma@
> >>
> >> > > wrote:
> >> >>
> >> >> >
> >> >> > A rather straight option is allow user to supply global dictionary
> >> >> > generated somewhere else or we build a separate tool just for
> >> >> generating
> >> >> as
> >> >> > well updating dictionary. Then the general normal data loading
> >> process
> >> >> will
> >> >> > encode columns with local dictionary if not supplied. This should
> >> >> cover
> >> >> > majority of cases for low-medium cardinality column. For the cases
> we
> >> >> have
> >> >> > to incorporate online dictionary update, use a lock mechanism to
> sync
> >> >> up
> >> >> > should serve the purpose.
> >> >> >
> >> >> > In another words, generating global dictionary is an optional step,
> >> >> only
> >> >> > triggered when needed, not a default step as we do currently.
> >> >> >
> >> >> > Jihong
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Ravindra Pesala [mailto:
> >>
> >> > ravi.pesala@
> >>
> >> > ]
> >> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> >> > To: dev
> >> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> >> > solution.
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > This discussion is regarding single pass data load solution.
> >> >> >
> >> >> > Currently data is loading to carbon in 2 pass/jobs
> >> >> > 1. Generating global dictionary using spark job.
> >> >> > 2. Encode the data with dictionary values and create carbondata
> >> files.
> >> >> > This 2 pass solution has many disadvantages like it needs to read
> the
> >> >> data
> >> >> > twice in case of csv files input or it needs to execute dataframe
> >> twice
> >> >> if
> >> >> > data is loaded from dataframe.
> >> >> >
> >> >> > In order to overcome from above issues of 2 pass dataloading, we
> can
> >> >> have
> >> >> > single pass dataloading and following are the alternate solutions.
> >> >> >
> >> >> > Use local dictionary
> >> >> > Use local dictionary for each carbondata file while loading data,
> >> but
> >> >> it
> >> >> > may lead to query performance degradation and more memory
> footprint.
> >> >> >
> >> >> > Use KV store/distributed map.
> >> >> > *HBase/Cassandra cluster : *
> >> >> > Dictionary data would be stored in KV store and generates the
> >> >> dictionary
> >> >> > value if it is not present in it. We all know the pros/cons of
> Hbase
> >> >> but
> >> >> > following are few.
> >> >> > Pros : These are apache licensed
> >> >> > Easy to implement to store/retreive dictionary values.
> >> >> > Performance need to be evaluated.
> >> >> >
> >> >> > Cons : Need to maintain seperate cluster for maintaining global
> >> >> > dictionary.
> >> >> >
> >> >> > *Hazlecast distributed map : *
> >> >> > Dictionary data could be saved in distributed concurrent hash map
> >> of
> >> >> > hazlecast. It is in-memory map and partioned as per number of
> nodes.
> >> >> And
> >> >> > even we can maintain the backups using sync/async functionality to
> >> >> avoid
> >> >> > the data loss when instance is down. We no need to maintain
> seperate
> >> >> > cluster for it as it can run on executor jvm itself.
> >> >> > Pros: It is apache licensed.
> >> >> > No need to maintain seperate cluster as instances can run
> in
> >> >> > executor jvms.
> >> >> > Easy to implement and store/retreive dictionary values.
> >> >> > It is pure java implementation.
> >> >> > There is no master/slave concept and no single point
> failure.
> >> >> >
> >> >> > Cons: Performance need to be evaluated.
> >> >> >
> >> >> > *Redis distributed map : *
> >> >> > It is also in-memory map but it is coded in c language so we
> >> should
> >> >> > have java client libraries to interact with redis. Need to maintain
> >> >> > seperate cluster for it. It also can partition the data.
> >> >> > Pros : More feature rich than Hazlecast.
> >> >> > Easy to implement and store/retreive dictionary values.
> >> >> > Cons : Need to maintain seperate cluster for maintaining global
> >> >> > dictionary.
> >> >> > May not be suitable for big data stack.
> >> >> > It is BSD licensed (Not sure whether we can use or not)
> >> >> > Online performance figures says it is little slower than
> hazlecast.
> >> >> >
> >> >> > Please let me know which would be best fit for our loading
> solution.
> >> >> And
> >> >> > please add any other suitable solution if I missed.
> >> >> > --
> >> >> > Thanks & Regards,
> >> >> > Ravi
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
http://apache-carbondata-maili> >> ng-list-archive.1130556.n5.nabble.com/Discussion-New-feat
> >> ure-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
>