http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1949.html
we should have solution to add global dictionary. So this solution should
tool for first time. That solution could be distributed map or KV store.
Ravi.
> Hi Liang,
>
> This tool is more or less like the first load, the first time after table
> is created, any subsequent loads/incremental loads will proceed and is
> capable of updating the global dictionary when it encounters new value,
> this is easiest way of achieving 1 pass data loading process without too
> much overhead.
>
> Since this tool is only triggered once per table, not considered too much
> burden on the end users. Making global dictionary generation out of the way
> of regular data loading is the key here.
>
> Jihong
>
> -----Original Message-----
> From: Liang Chen [mailto:
[hidden email]]
> Sent: Thursday, October 13, 2016 5:39 PM
> To:
[hidden email]
> Subject: RE: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi jihong
>
> I am not sure that users can accept to use extra tool to do this work,
> because provide tool or do scan at first time per table for most of global
> dict are same cost from users perspective, and maintain the dict file also
> be same cost, they always expecting that system can automatically and
> internally generate dict file during loading data.
>
> Can we consider this:
> first load: make scan to generate most of global dict file, then copy this
> file to each load node for subsequent loading
>
> Regards
> Liang
>
>
> Jihong Ma wrote
> >>>>>the question is what would be the default implementation? Load data
> without dictionary?
> >
> > My thought is we can provide a tool to generate global dictionary using
> > sample data set, so the initial global dictionaries is available before
> > normal data loading. We shall be able to perform encoding based on that,
> > we only need to handle occasionally adding entries while loading. For
> > columns specified with global dictionary encoding, but dictionary is not
> > placed before data loading, we error out and direct user to use the tool
> > first.
> >
> > Make sense?
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> > Sent: Thursday, October 13, 2016 1:12 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi Jihong/Aniket,
> >
> > In the current implementation of carbondata we are already handling
> > external dictionary while loading the data.
> > But here the question is what would be the default implementation? Load
> > data with out dictionary?
> >
> >
> > Regards,
> > Ravi
> >
> > On 13 October 2016 at 03:50, Aniket Adnaik <
>
> > aniket.adnaik@
>
> > > wrote:
> >
> >> Hi Ravi,
> >>
> >> 1. I agree with Jihong that creation of global dictionary should be
> >> optional, so that it can be disabled to improve the load performance.
> >> User
> >> should be made aware that using global dictionary may boost the query
> >> performance.
> >> 2. We should have a generic interface to manage global dictionary when
> >> its
> >> from external sources. In general, it is not a good idea to depend on
> too
> >> many external tools.
> >> 3. May be we should allow user to generate global dictionary separately
> >> through SQL command or similar. Something like materialized view. This
> >> means carbon should avoid using local dictionary and do late
> >> materialization when global dictionary is present.
> >> 4. May be we should think of some ways to create global dictionary
> lazily
> >> as we serve SELECT queries. Implementation may not be that straight
> >> forward. Not sure if its worth the effort.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >>
> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
>
> > Jihong.Ma@
>
> > > wrote:
> >>
> >> >
> >> > A rather straight option is allow user to supply global dictionary
> >> > generated somewhere else or we build a separate tool just for
> >> generating
> >> as
> >> > well updating dictionary. Then the general normal data loading process
> >> will
> >> > encode columns with local dictionary if not supplied. This should
> >> cover
> >> > majority of cases for low-medium cardinality column. For the cases we
> >> have
> >> > to incorporate online dictionary update, use a lock mechanism to sync
> >> up
> >> > should serve the purpose.
> >> >
> >> > In another words, generating global dictionary is an optional step,
> >> only
> >> > triggered when needed, not a default step as we do currently.
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> > To: dev
> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> > solution.
> >> >
> >> > Hi All,
> >> >
> >> > This discussion is regarding single pass data load solution.
> >> >
> >> > Currently data is loading to carbon in 2 pass/jobs
> >> > 1. Generating global dictionary using spark job.
> >> > 2. Encode the data with dictionary values and create carbondata
> files.
> >> > This 2 pass solution has many disadvantages like it needs to read the
> >> data
> >> > twice in case of csv files input or it needs to execute dataframe
> twice
> >> if
> >> > data is loaded from dataframe.
> >> >
> >> > In order to overcome from above issues of 2 pass dataloading, we can
> >> have
> >> > single pass dataloading and following are the alternate solutions.
> >> >
> >> > Use local dictionary
> >> > Use local dictionary for each carbondata file while loading data, but
> >> it
> >> > may lead to query performance degradation and more memory footprint.
> >> >
> >> > Use KV store/distributed map.
> >> > *HBase/Cassandra cluster : *
> >> > Dictionary data would be stored in KV store and generates the
> >> dictionary
> >> > value if it is not present in it. We all know the pros/cons of Hbase
> >> but
> >> > following are few.
> >> > Pros : These are apache licensed
> >> > Easy to implement to store/retreive dictionary values.
> >> > Performance need to be evaluated.
> >> >
> >> > Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >
> >> > *Hazlecast distributed map : *
> >> > Dictionary data could be saved in distributed concurrent hash map of
> >> > hazlecast. It is in-memory map and partioned as per number of nodes.
> >> And
> >> > even we can maintain the backups using sync/async functionality to
> >> avoid
> >> > the data loss when instance is down. We no need to maintain seperate
> >> > cluster for it as it can run on executor jvm itself.
> >> > Pros: It is apache licensed.
> >> > No need to maintain seperate cluster as instances can run in
> >> > executor jvms.
> >> > Easy to implement and store/retreive dictionary values.
> >> > It is pure java implementation.
> >> > There is no master/slave concept and no single point failure.
> >> >
> >> > Cons: Performance need to be evaluated.
> >> >
> >> > *Redis distributed map : *
> >> > It is also in-memory map but it is coded in c language so we
> should
> >> > have java client libraries to interact with redis. Need to maintain
> >> > seperate cluster for it. It also can partition the data.
> >> > Pros : More feature rich than Hazlecast.
> >> > Easy to implement and store/retreive dictionary values.
> >> > Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> > May not be suitable for big data stack.
> >> > It is BSD licensed (Not sure whether we can use or not)
> >> > Online performance figures says it is little slower than hazlecast.
> >> >
> >> > Please let me know which would be best fit for our loading solution.
> >> And
> >> > please add any other suitable solution if I missed.
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >> >
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>
>
> --
> View this message in context:
http://apache-carbondata-> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>