http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1965.html
load but that is completely depends on user data scenarios. In some
out insisting them to run external tool first. We can provide the option to
Hazlecast than Zookeeper + HDFS. It is lite weight and does not require to
implementation for dictionary generation. We can have multiple
> Hi,
>
> I can offer one more approach for this discussion, since new dictionary
> values are rare in case of incremental load (ensure first load having as
> much dictionary value as possible), so synchronization should be rare. So
> how about using Zookeeper + HDFS file to provide this service. This is what
> carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> dictionary interface.
> It has the benefit of
> 1. automated: without bordering the user
> 2. not introducing more dependency: we already using zookeeper and HDFS.
> 3. performance? since new dictionary value and synchronization is rare.
>
> What do you think?
>
> Regards,
> Jacky
>
> > 在 2016年10月15日,上午2:38,Jihong Ma <
[hidden email]> 写道:
> >
> > Hi Ravi,
> >
> > The major concern I have for generating global dictionary from scratch
> with a single scan is performance, the way to handle an occasional update
> to the dictionary is way simpler and cost effective in terms of
> synchronization cost and refresh the global/local cache copy.
> >
> > There are a lot to worry about for distributed map, and leveraging KV
> store is overkill if simply just for dictionary generation.
> >
> > Regards.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:
[hidden email]]
> > Sent: Friday, October 14, 2016 11:03 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
> >
> > Hi Jihong,
> >
> > I agree, we can use external tool for first load, but for incremental
> load
> > we should have solution to add global dictionary. So this solution should
> > be enough to generate global dictionary even if user does not use
> external
> > tool for first time. That solution could be distributed map or KV store.
> >
> > Regards,
> > Ravi.
> >
> > On 14 October 2016 at 23:12, Jihong Ma <
[hidden email]> wrote:
> >
> >> Hi Liang,
> >>
> >> This tool is more or less like the first load, the first time after
> table
> >> is created, any subsequent loads/incremental loads will proceed and is
> >> capable of updating the global dictionary when it encounters new value,
> >> this is easiest way of achieving 1 pass data loading process without too
> >> much overhead.
> >>
> >> Since this tool is only triggered once per table, not considered too
> much
> >> burden on the end users. Making global dictionary generation out of the
> way
> >> of regular data loading is the key here.
> >>
> >> Jihong
> >>
> >> -----Original Message-----
> >> From: Liang Chen [mailto:
[hidden email]]
> >> Sent: Thursday, October 13, 2016 5:39 PM
> >> To:
[hidden email]
> >> Subject: RE: Discussion(New feature) regarding single pass data loading
> >> solution.
> >>
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >>>>>>> the question is what would be the default implementation? Load data
> >> without dictionary?
> >>>
> >>> My thought is we can provide a tool to generate global dictionary using
> >>> sample data set, so the initial global dictionaries is available before
> >>> normal data loading. We shall be able to perform encoding based on
> that,
> >>> we only need to handle occasionally adding entries while loading. For
> >>> columns specified with global dictionary encoding, but dictionary is
> not
> >>> placed before data loading, we error out and direct user to use the
> tool
> >>> first.
> >>>
> >>> Make sense?
> >>>
> >>> Jihong
> >>>
> >>> -----Original Message-----
> >>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>> Sent: Thursday, October 13, 2016 1:12 AM
> >>> To: dev
> >>> Subject: Re: Discussion(New feature) regarding single pass data loading
> >>> solution.
> >>>
> >>> Hi Jihong/Aniket,
> >>>
> >>> In the current implementation of carbondata we are already handling
> >>> external dictionary while loading the data.
> >>> But here the question is what would be the default implementation? Load
> >>> data with out dictionary?
> >>>
> >>>
> >>> Regards,
> >>> Ravi
> >>>
> >>> On 13 October 2016 at 03:50, Aniket Adnaik <
> >>
> >>> aniket.adnaik@
> >>
> >>> > wrote:
> >>>
> >>>> Hi Ravi,
> >>>>
> >>>> 1. I agree with Jihong that creation of global dictionary should be
> >>>> optional, so that it can be disabled to improve the load performance.
> >>>> User
> >>>> should be made aware that using global dictionary may boost the query
> >>>> performance.
> >>>> 2. We should have a generic interface to manage global dictionary when
> >>>> its
> >>>> from external sources. In general, it is not a good idea to depend on
> >> too
> >>>> many external tools.
> >>>> 3. May be we should allow user to generate global dictionary
> separately
> >>>> through SQL command or similar. Something like materialized view. This
> >>>> means carbon should avoid using local dictionary and do late
> >>>> materialization when global dictionary is present.
> >>>> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >>>> as we serve SELECT queries. Implementation may not be that straight
> >>>> forward. Not sure if its worth the effort.
> >>>>
> >>>> Best Regards,
> >>>> Aniket
> >>>>
> >>>>
> >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
> >>
> >>> Jihong.Ma@
> >>
> >>> > wrote:
> >>>>
> >>>>>
> >>>>> A rather straight option is allow user to supply global dictionary
> >>>>> generated somewhere else or we build a separate tool just for
> >>>> generating
> >>>> as
> >>>>> well updating dictionary. Then the general normal data loading
> process
> >>>> will
> >>>>> encode columns with local dictionary if not supplied. This should
> >>>> cover
> >>>>> majority of cases for low-medium cardinality column. For the cases we
> >>>> have
> >>>>> to incorporate online dictionary update, use a lock mechanism to sync
> >>>> up
> >>>>> should serve the purpose.
> >>>>>
> >>>>> In another words, generating global dictionary is an optional step,
> >>>> only
> >>>>> triggered when needed, not a default step as we do currently.
> >>>>>
> >>>>> Jihong
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> >>>>> To: dev
> >>>>> Subject: Discussion(New feature) regarding single pass data loading
> >>>>> solution.
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This discussion is regarding single pass data load solution.
> >>>>>
> >>>>> Currently data is loading to carbon in 2 pass/jobs
> >>>>> 1. Generating global dictionary using spark job.
> >>>>> 2. Encode the data with dictionary values and create carbondata
> >> files.
> >>>>> This 2 pass solution has many disadvantages like it needs to read the
> >>>> data
> >>>>> twice in case of csv files input or it needs to execute dataframe
> >> twice
> >>>> if
> >>>>> data is loaded from dataframe.
> >>>>>
> >>>>> In order to overcome from above issues of 2 pass dataloading, we can
> >>>> have
> >>>>> single pass dataloading and following are the alternate solutions.
> >>>>>
> >>>>> Use local dictionary
> >>>>> Use local dictionary for each carbondata file while loading data, but
> >>>> it
> >>>>> may lead to query performance degradation and more memory footprint.
> >>>>>
> >>>>> Use KV store/distributed map.
> >>>>> *HBase/Cassandra cluster : *
> >>>>> Dictionary data would be stored in KV store and generates the
> >>>> dictionary
> >>>>> value if it is not present in it. We all know the pros/cons of Hbase
> >>>> but
> >>>>> following are few.
> >>>>> Pros : These are apache licensed
> >>>>> Easy to implement to store/retreive dictionary values.
> >>>>> Performance need to be evaluated.
> >>>>>
> >>>>> Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>
> >>>>> *Hazlecast distributed map : *
> >>>>> Dictionary data could be saved in distributed concurrent hash map of
> >>>>> hazlecast. It is in-memory map and partioned as per number of nodes.
> >>>> And
> >>>>> even we can maintain the backups using sync/async functionality to
> >>>> avoid
> >>>>> the data loss when instance is down. We no need to maintain seperate
> >>>>> cluster for it as it can run on executor jvm itself.
> >>>>> Pros: It is apache licensed.
> >>>>> No need to maintain seperate cluster as instances can run in
> >>>>> executor jvms.
> >>>>> Easy to implement and store/retreive dictionary values.
> >>>>> It is pure java implementation.
> >>>>> There is no master/slave concept and no single point failure.
> >>>>>
> >>>>> Cons: Performance need to be evaluated.
> >>>>>
> >>>>> *Redis distributed map : *
> >>>>> It is also in-memory map but it is coded in c language so we
> >> should
> >>>>> have java client libraries to interact with redis. Need to maintain
> >>>>> seperate cluster for it. It also can partition the data.
> >>>>> Pros : More feature rich than Hazlecast.
> >>>>> Easy to implement and store/retreive dictionary values.
> >>>>> Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>> May not be suitable for big data stack.
> >>>>> It is BSD licensed (Not sure whether we can use or not)
> >>>>> Online performance figures says it is little slower than hazlecast.
> >>>>>
> >>>>> Please let me know which would be best fit for our loading solution.
> >>>> And
> >>>>> please add any other suitable solution if I missed.
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Ravi
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
http://apache-carbondata-> >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>