http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p2023.html
data from another partition from cluster if available. Any way we can
availability. The backup is done through sync/async mode and consistency is
memory. And also hazlecast map supports locks to ensure data consistency,
> Hi Ravi,
>
> I took a quick look at Hazlecast, what they offer is a distributed map
> across cluster (on any single node only portion of the map is stored), to
> facilitate parallel data loading I think we need a complete copy on each
> node, is this the structure we are looking for?
>
> it does allow map in-memory backup in case one node goes down, to ensure
> its persistency, they allow storing map to db, but requires implementing
> their API to hook them up, there are async/ sync mode supported with no
> guarantee in terms of consistency, unless going further for a transaction
> support, 2-phase commit/XA are offered with read-committed isolation, to
> achieve that is quite complicated when we need to ensure ACID on changes to
> the map. I suggest you to investigate further to understand the implication
> and effort.
>
> We all understand We couldn't afford any inconsistency on dictionary, that
> means we couldn't decode the data back correctly. correctness is even more
> critical compared to performance.
>
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:
[hidden email]]
> Sent: Saturday, October 15, 2016 12:50 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi Jacky/Jihong,
>
> I agree that new dictionary values are less in case of incremental data
> load but that is completely depends on user data scenarios. In some
> user scenarios new dictionary values may be more we cannot overrule that.
> And also for users convenience we should provide single pass solution with
> out insisting them to run external tool first. We can provide the option to
> run external tool first and provide dictionary to improve performance.
>
> My opinion is better to use some professional distributed map like
> Hazlecast than Zookeeper + HDFS. It is lite weight and does not require to
> have separate cluster, it can form the cluster within the executor jvm's .
> May be we can have a try, after all it will be just one interface
> implementation for dictionary generation. We can have multiple
> implementations and then decide based on optimal performance.
>
> Regards,
> Ravi
>
> On 15 October 2016 at 10:50, Jacky Li <
[hidden email]> wrote:
>
> > Hi,
> >
> > I can offer one more approach for this discussion, since new dictionary
> > values are rare in case of incremental load (ensure first load having as
> > much dictionary value as possible), so synchronization should be rare. So
> > how about using Zookeeper + HDFS file to provide this service. This is
> what
> > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> > dictionary interface.
> > It has the benefit of
> > 1. automated: without bordering the user
> > 2. not introducing more dependency: we already using zookeeper and HDFS.
> > 3. performance? since new dictionary value and synchronization is rare.
> >
> > What do you think?
> >
> > Regards,
> > Jacky
> >
> > > 在 2016年10月15日,上午2:38,Jihong Ma <
[hidden email]> 写道:
> > >
> > > Hi Ravi,
> > >
> > > The major concern I have for generating global dictionary from scratch
> > with a single scan is performance, the way to handle an occasional update
> > to the dictionary is way simpler and cost effective in terms of
> > synchronization cost and refresh the global/local cache copy.
> > >
> > > There are a lot to worry about for distributed map, and leveraging KV
> > store is overkill if simply just for dictionary generation.
> > >
> > > Regards.
> > >
> > > Jihong
> > >
> > > -----Original Message-----
> > > From: Ravindra Pesala [mailto:
[hidden email]]
> > > Sent: Friday, October 14, 2016 11:03 AM
> > > To: dev
> > > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> > >
> > > Hi Jihong,
> > >
> > > I agree, we can use external tool for first load, but for incremental
> > load
> > > we should have solution to add global dictionary. So this solution
> should
> > > be enough to generate global dictionary even if user does not use
> > external
> > > tool for first time. That solution could be distributed map or KV
> store.
> > >
> > > Regards,
> > > Ravi.
> > >
> > > On 14 October 2016 at 23:12, Jihong Ma <
[hidden email]> wrote:
> > >
> > >> Hi Liang,
> > >>
> > >> This tool is more or less like the first load, the first time after
> > table
> > >> is created, any subsequent loads/incremental loads will proceed and is
> > >> capable of updating the global dictionary when it encounters new
> value,
> > >> this is easiest way of achieving 1 pass data loading process without
> too
> > >> much overhead.
> > >>
> > >> Since this tool is only triggered once per table, not considered too
> > much
> > >> burden on the end users. Making global dictionary generation out of
> the
> > way
> > >> of regular data loading is the key here.
> > >>
> > >> Jihong
> > >>
> > >> -----Original Message-----
> > >> From: Liang Chen [mailto:
[hidden email]]
> > >> Sent: Thursday, October 13, 2016 5:39 PM
> > >> To:
[hidden email]
> > >> Subject: RE: Discussion(New feature) regarding single pass data
> loading
> > >> solution.
> > >>
> > >> Hi jihong
> > >>
> > >> I am not sure that users can accept to use extra tool to do this work,
> > >> because provide tool or do scan at first time per table for most of
> > global
> > >> dict are same cost from users perspective, and maintain the dict file
> > also
> > >> be same cost, they always expecting that system can automatically and
> > >> internally generate dict file during loading data.
> > >>
> > >> Can we consider this:
> > >> first load: make scan to generate most of global dict file, then copy
> > this
> > >> file to each load node for subsequent loading
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> Jihong Ma wrote
> > >>>>>>> the question is what would be the default implementation? Load
> data
> > >> without dictionary?
> > >>>
> > >>> My thought is we can provide a tool to generate global dictionary
> using
> > >>> sample data set, so the initial global dictionaries is available
> before
> > >>> normal data loading. We shall be able to perform encoding based on
> > that,
> > >>> we only need to handle occasionally adding entries while loading. For
> > >>> columns specified with global dictionary encoding, but dictionary is
> > not
> > >>> placed before data loading, we error out and direct user to use the
> > tool
> > >>> first.
> > >>>
> > >>> Make sense?
> > >>>
> > >>> Jihong
> > >>>
> > >>> -----Original Message-----
> > >>> From: Ravindra Pesala [mailto:
> > >>
> > >>> ravi.pesala@
> > >>
> > >>> ]
> > >>> Sent: Thursday, October 13, 2016 1:12 AM
> > >>> To: dev
> > >>> Subject: Re: Discussion(New feature) regarding single pass data
> loading
> > >>> solution.
> > >>>
> > >>> Hi Jihong/Aniket,
> > >>>
> > >>> In the current implementation of carbondata we are already handling
> > >>> external dictionary while loading the data.
> > >>> But here the question is what would be the default implementation?
> Load
> > >>> data with out dictionary?
> > >>>
> > >>>
> > >>> Regards,
> > >>> Ravi
> > >>>
> > >>> On 13 October 2016 at 03:50, Aniket Adnaik <
> > >>
> > >>> aniket.adnaik@
> > >>
> > >>> > wrote:
> > >>>
> > >>>> Hi Ravi,
> > >>>>
> > >>>> 1. I agree with Jihong that creation of global dictionary should be
> > >>>> optional, so that it can be disabled to improve the load
> performance.
> > >>>> User
> > >>>> should be made aware that using global dictionary may boost the
> query
> > >>>> performance.
> > >>>> 2. We should have a generic interface to manage global dictionary
> when
> > >>>> its
> > >>>> from external sources. In general, it is not a good idea to depend
> on
> > >> too
> > >>>> many external tools.
> > >>>> 3. May be we should allow user to generate global dictionary
> > separately
> > >>>> through SQL command or similar. Something like materialized view.
> This
> > >>>> means carbon should avoid using local dictionary and do late
> > >>>> materialization when global dictionary is present.
> > >>>> 4. May be we should think of some ways to create global dictionary
> > >> lazily
> > >>>> as we serve SELECT queries. Implementation may not be that straight
> > >>>> forward. Not sure if its worth the effort.
> > >>>>
> > >>>> Best Regards,
> > >>>> Aniket
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
> > >>
> > >>> Jihong.Ma@
> > >>
> > >>> > wrote:
> > >>>>
> > >>>>>
> > >>>>> A rather straight option is allow user to supply global dictionary
> > >>>>> generated somewhere else or we build a separate tool just for
> > >>>> generating
> > >>>> as
> > >>>>> well updating dictionary. Then the general normal data loading
> > process
> > >>>> will
> > >>>>> encode columns with local dictionary if not supplied. This should
> > >>>> cover
> > >>>>> majority of cases for low-medium cardinality column. For the cases
> we
> > >>>> have
> > >>>>> to incorporate online dictionary update, use a lock mechanism to
> sync
> > >>>> up
> > >>>>> should serve the purpose.
> > >>>>>
> > >>>>> In another words, generating global dictionary is an optional step,
> > >>>> only
> > >>>>> triggered when needed, not a default step as we do currently.
> > >>>>>
> > >>>>> Jihong
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Ravindra Pesala [mailto:
> > >>
> > >>> ravi.pesala@
> > >>
> > >>> ]
> > >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> > >>>>> To: dev
> > >>>>> Subject: Discussion(New feature) regarding single pass data loading
> > >>>>> solution.
> > >>>>>
> > >>>>> Hi All,
> > >>>>>
> > >>>>> This discussion is regarding single pass data load solution.
> > >>>>>
> > >>>>> Currently data is loading to carbon in 2 pass/jobs
> > >>>>> 1. Generating global dictionary using spark job.
> > >>>>> 2. Encode the data with dictionary values and create carbondata
> > >> files.
> > >>>>> This 2 pass solution has many disadvantages like it needs to read
> the
> > >>>> data
> > >>>>> twice in case of csv files input or it needs to execute dataframe
> > >> twice
> > >>>> if
> > >>>>> data is loaded from dataframe.
> > >>>>>
> > >>>>> In order to overcome from above issues of 2 pass dataloading, we
> can
> > >>>> have
> > >>>>> single pass dataloading and following are the alternate solutions.
> > >>>>>
> > >>>>> Use local dictionary
> > >>>>> Use local dictionary for each carbondata file while loading data,
> but
> > >>>> it
> > >>>>> may lead to query performance degradation and more memory
> footprint.
> > >>>>>
> > >>>>> Use KV store/distributed map.
> > >>>>> *HBase/Cassandra cluster : *
> > >>>>> Dictionary data would be stored in KV store and generates the
> > >>>> dictionary
> > >>>>> value if it is not present in it. We all know the pros/cons of
> Hbase
> > >>>> but
> > >>>>> following are few.
> > >>>>> Pros : These are apache licensed
> > >>>>> Easy to implement to store/retreive dictionary values.
> > >>>>> Performance need to be evaluated.
> > >>>>>
> > >>>>> Cons : Need to maintain seperate cluster for maintaining global
> > >>>>> dictionary.
> > >>>>>
> > >>>>> *Hazlecast distributed map : *
> > >>>>> Dictionary data could be saved in distributed concurrent hash map
> of
> > >>>>> hazlecast. It is in-memory map and partioned as per number of
> nodes.
> > >>>> And
> > >>>>> even we can maintain the backups using sync/async functionality to
> > >>>> avoid
> > >>>>> the data loss when instance is down. We no need to maintain
> seperate
> > >>>>> cluster for it as it can run on executor jvm itself.
> > >>>>> Pros: It is apache licensed.
> > >>>>> No need to maintain seperate cluster as instances can run in
> > >>>>> executor jvms.
> > >>>>> Easy to implement and store/retreive dictionary values.
> > >>>>> It is pure java implementation.
> > >>>>> There is no master/slave concept and no single point
> failure.
> > >>>>>
> > >>>>> Cons: Performance need to be evaluated.
> > >>>>>
> > >>>>> *Redis distributed map : *
> > >>>>> It is also in-memory map but it is coded in c language so we
> > >> should
> > >>>>> have java client libraries to interact with redis. Need to maintain
> > >>>>> seperate cluster for it. It also can partition the data.
> > >>>>> Pros : More feature rich than Hazlecast.
> > >>>>> Easy to implement and store/retreive dictionary values.
> > >>>>> Cons : Need to maintain seperate cluster for maintaining global
> > >>>>> dictionary.
> > >>>>> May not be suitable for big data stack.
> > >>>>> It is BSD licensed (Not sure whether we can use or not)
> > >>>>> Online performance figures says it is little slower than
> hazlecast.
> > >>>>>
> > >>>>> Please let me know which would be best fit for our loading
> solution.
> > >>>> And
> > >>>>> please add any other suitable solution if I missed.
> > >>>>> --
> > >>>>> Thanks & Regards,
> > >>>>> Ravi
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks & Regards,
> > >>> Ravi
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context:
http://apache-carbondata-> > >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> > >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> > >> Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > >> at Nabble.com.
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> >
> >
> >
> >
>
>
> --
> Thanks & Regards,
> Ravi
>