http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1866.html
external dictionary while loading the data.
> Hi Ravi,
>
> 1. I agree with Jihong that creation of global dictionary should be
> optional, so that it can be disabled to improve the load performance. User
> should be made aware that using global dictionary may boost the query
> performance.
> 2. We should have a generic interface to manage global dictionary when its
> from external sources. In general, it is not a good idea to depend on too
> many external tools.
> 3. May be we should allow user to generate global dictionary separately
> through SQL command or similar. Something like materialized view. This
> means carbon should avoid using local dictionary and do late
> materialization when global dictionary is present.
> 4. May be we should think of some ways to create global dictionary lazily
> as we serve SELECT queries. Implementation may not be that straight
> forward. Not sure if its worth the effort.
>
> Best Regards,
> Aniket
>
>
> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
[hidden email]> wrote:
>
> >
> > A rather straight option is allow user to supply global dictionary
> > generated somewhere else or we build a separate tool just for generating
> as
> > well updating dictionary. Then the general normal data loading process
> will
> > encode columns with local dictionary if not supplied. This should cover
> > majority of cases for low-medium cardinality column. For the cases we
> have
> > to incorporate online dictionary update, use a lock mechanism to sync up
> > should serve the purpose.
> >
> > In another words, generating global dictionary is an optional step, only
> > triggered when needed, not a default step as we do currently.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:
[hidden email]]
> > Sent: Tuesday, October 11, 2016 2:33 AM
> > To: dev
> > Subject: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi All,
> >
> > This discussion is regarding single pass data load solution.
> >
> > Currently data is loading to carbon in 2 pass/jobs
> > 1. Generating global dictionary using spark job.
> > 2. Encode the data with dictionary values and create carbondata files.
> > This 2 pass solution has many disadvantages like it needs to read the
> data
> > twice in case of csv files input or it needs to execute dataframe twice
> if
> > data is loaded from dataframe.
> >
> > In order to overcome from above issues of 2 pass dataloading, we can have
> > single pass dataloading and following are the alternate solutions.
> >
> > Use local dictionary
> > Use local dictionary for each carbondata file while loading data, but it
> > may lead to query performance degradation and more memory footprint.
> >
> > Use KV store/distributed map.
> > *HBase/Cassandra cluster : *
> > Dictionary data would be stored in KV store and generates the
> dictionary
> > value if it is not present in it. We all know the pros/cons of Hbase but
> > following are few.
> > Pros : These are apache licensed
> > Easy to implement to store/retreive dictionary values.
> > Performance need to be evaluated.
> >
> > Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >
> > *Hazlecast distributed map : *
> > Dictionary data could be saved in distributed concurrent hash map of
> > hazlecast. It is in-memory map and partioned as per number of nodes. And
> > even we can maintain the backups using sync/async functionality to avoid
> > the data loss when instance is down. We no need to maintain seperate
> > cluster for it as it can run on executor jvm itself.
> > Pros: It is apache licensed.
> > No need to maintain seperate cluster as instances can run in
> > executor jvms.
> > Easy to implement and store/retreive dictionary values.
> > It is pure java implementation.
> > There is no master/slave concept and no single point failure.
> >
> > Cons: Performance need to be evaluated.
> >
> > *Redis distributed map : *
> > It is also in-memory map but it is coded in c language so we should
> > have java client libraries to interact with redis. Need to maintain
> > seperate cluster for it. It also can partition the data.
> > Pros : More feature rich than Hazlecast.
> > Easy to implement and store/retreive dictionary values.
> > Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> > May not be suitable for big data stack.
> > It is BSD licensed (Not sure whether we can use or not)
> > Online performance figures says it is little slower than hazlecast.
> >
> > Please let me know which would be best fit for our loading solution. And
> > please add any other suitable solution if I missed.
> > --
> > Thanks & Regards,
> > Ravi
> >
>