http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1889.html
be a viable option. only future loads can benefit from it. But then will
dictionary. May be we should not consider this option.
> I have following comments;
>
> 1. If external dictionary is provided, we accept it. This interface should
> be generic enough, so that we can perform lookup, add, delete, create and
> drop functionality. I believe we already have this functionality to some
> extent. As long as we are able to maintain the dictionary it should be fine.
> 2. If external dictionary is not provided, then by default we should build
> it internally, which is our current behavior.This will continue to impact
> the load performance though.
> 3. If load performance is not acceptable, we should allow user to disable
> building of global dictionary. Carbon should build local dictionary
> instead. Will this setting apply to all subsequent loads ? may be yes for
> now.
> 4. If User decides to build dictionary at later point, either via external
> tool
> or using carbon sql command ("CREATE DICTIONARY TABLE...") we should
> provide that facility. This will help user to improve query performance
> through late materialization. The local dictionary will not be used in this
> case. Sebsequent loads
> will continue to add new entries to this new dictionary (external or
> carbon specific).
>
> This doesn't really solve our double pass problem, but kind of works
> around it by isolating dictionary building operation out of critical path.
>
>
> Best Regards,
> Aniket
>
>
> On Thu, Oct 13, 2016 at 5:39 PM, Liang Chen <
[hidden email]>
> wrote:
>
>> Hi jihong
>>
>> I am not sure that users can accept to use extra tool to do this work,
>> because provide tool or do scan at first time per table for most of global
>> dict are same cost from users perspective, and maintain the dict file also
>> be same cost, they always expecting that system can automatically and
>> internally generate dict file during loading data.
>>
>> Can we consider this:
>> first load: make scan to generate most of global dict file, then copy this
>> file to each load node for subsequent loading
>>
>> Regards
>> Liang
>>
>>
>> Jihong Ma wrote
>> >>>>>the question is what would be the default implementation? Load data
>> without dictionary?
>> >
>> > My thought is we can provide a tool to generate global dictionary using
>> > sample data set, so the initial global dictionaries is available before
>> > normal data loading. We shall be able to perform encoding based on that,
>> > we only need to handle occasionally adding entries while loading. For
>> > columns specified with global dictionary encoding, but dictionary is not
>> > placed before data loading, we error out and direct user to use the tool
>> > first.
>> >
>> > Make sense?
>> >
>> > Jihong
>> >
>> > -----Original Message-----
>> > From: Ravindra Pesala [mailto:
>>
>> > ravi.pesala@
>>
>> > ]
>> > Sent: Thursday, October 13, 2016 1:12 AM
>> > To: dev
>> > Subject: Re: Discussion(New feature) regarding single pass data loading
>> > solution.
>> >
>> > Hi Jihong/Aniket,
>> >
>> > In the current implementation of carbondata we are already handling
>> > external dictionary while loading the data.
>> > But here the question is what would be the default implementation? Load
>> > data with out dictionary?
>> >
>> >
>> > Regards,
>> > Ravi
>> >
>> > On 13 October 2016 at 03:50, Aniket Adnaik <
>>
>> > aniket.adnaik@
>>
>> > > wrote:
>> >
>> >> Hi Ravi,
>> >>
>> >> 1. I agree with Jihong that creation of global dictionary should be
>> >> optional, so that it can be disabled to improve the load performance.
>> >> User
>> >> should be made aware that using global dictionary may boost the query
>> >> performance.
>> >> 2. We should have a generic interface to manage global dictionary when
>> >> its
>> >> from external sources. In general, it is not a good idea to depend on
>> too
>> >> many external tools.
>> >> 3. May be we should allow user to generate global dictionary separately
>> >> through SQL command or similar. Something like materialized view. This
>> >> means carbon should avoid using local dictionary and do late
>> >> materialization when global dictionary is present.
>> >> 4. May be we should think of some ways to create global dictionary
>> lazily
>> >> as we serve SELECT queries. Implementation may not be that straight
>> >> forward. Not sure if its worth the effort.
>> >>
>> >> Best Regards,
>> >> Aniket
>> >>
>> >>
>> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
>>
>> > Jihong.Ma@
>>
>> > > wrote:
>> >>
>> >> >
>> >> > A rather straight option is allow user to supply global dictionary
>> >> > generated somewhere else or we build a separate tool just for
>> >> generating
>> >> as
>> >> > well updating dictionary. Then the general normal data loading
>> process
>> >> will
>> >> > encode columns with local dictionary if not supplied. This should
>> >> cover
>> >> > majority of cases for low-medium cardinality column. For the cases we
>> >> have
>> >> > to incorporate online dictionary update, use a lock mechanism to sync
>> >> up
>> >> > should serve the purpose.
>> >> >
>> >> > In another words, generating global dictionary is an optional step,
>> >> only
>> >> > triggered when needed, not a default step as we do currently.
>> >> >
>> >> > Jihong
>> >> >
>> >> > -----Original Message-----
>> >> > From: Ravindra Pesala [mailto:
>>
>> > ravi.pesala@
>>
>> > ]
>> >> > Sent: Tuesday, October 11, 2016 2:33 AM
>> >> > To: dev
>> >> > Subject: Discussion(New feature) regarding single pass data loading
>> >> > solution.
>> >> >
>> >> > Hi All,
>> >> >
>> >> > This discussion is regarding single pass data load solution.
>> >> >
>> >> > Currently data is loading to carbon in 2 pass/jobs
>> >> > 1. Generating global dictionary using spark job.
>> >> > 2. Encode the data with dictionary values and create carbondata
>> files.
>> >> > This 2 pass solution has many disadvantages like it needs to read the
>> >> data
>> >> > twice in case of csv files input or it needs to execute dataframe
>> twice
>> >> if
>> >> > data is loaded from dataframe.
>> >> >
>> >> > In order to overcome from above issues of 2 pass dataloading, we can
>> >> have
>> >> > single pass dataloading and following are the alternate solutions.
>> >> >
>> >> > Use local dictionary
>> >> > Use local dictionary for each carbondata file while loading data,
>> but
>> >> it
>> >> > may lead to query performance degradation and more memory footprint.
>> >> >
>> >> > Use KV store/distributed map.
>> >> > *HBase/Cassandra cluster : *
>> >> > Dictionary data would be stored in KV store and generates the
>> >> dictionary
>> >> > value if it is not present in it. We all know the pros/cons of Hbase
>> >> but
>> >> > following are few.
>> >> > Pros : These are apache licensed
>> >> > Easy to implement to store/retreive dictionary values.
>> >> > Performance need to be evaluated.
>> >> >
>> >> > Cons : Need to maintain seperate cluster for maintaining global
>> >> > dictionary.
>> >> >
>> >> > *Hazlecast distributed map : *
>> >> > Dictionary data could be saved in distributed concurrent hash map
>> of
>> >> > hazlecast. It is in-memory map and partioned as per number of nodes.
>> >> And
>> >> > even we can maintain the backups using sync/async functionality to
>> >> avoid
>> >> > the data loss when instance is down. We no need to maintain seperate
>> >> > cluster for it as it can run on executor jvm itself.
>> >> > Pros: It is apache licensed.
>> >> > No need to maintain seperate cluster as instances can run in
>> >> > executor jvms.
>> >> > Easy to implement and store/retreive dictionary values.
>> >> > It is pure java implementation.
>> >> > There is no master/slave concept and no single point failure.
>> >> >
>> >> > Cons: Performance need to be evaluated.
>> >> >
>> >> > *Redis distributed map : *
>> >> > It is also in-memory map but it is coded in c language so we
>> should
>> >> > have java client libraries to interact with redis. Need to maintain
>> >> > seperate cluster for it. It also can partition the data.
>> >> > Pros : More feature rich than Hazlecast.
>> >> > Easy to implement and store/retreive dictionary values.
>> >> > Cons : Need to maintain seperate cluster for maintaining global
>> >> > dictionary.
>> >> > May not be suitable for big data stack.
>> >> > It is BSD licensed (Not sure whether we can use or not)
>> >> > Online performance figures says it is little slower than hazlecast.
>> >> >
>> >> > Please let me know which would be best fit for our loading solution.
>> >> And
>> >> > please add any other suitable solution if I missed.
>> >> > --
>> >> > Thanks & Regards,
>> >> > Ravi
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>>
>>
>>
>>
>>
>> --
>> View this message in context:
http://apache-carbondata-maili>> ng-list-archive.1130556.n5.nabble.com/Discussion-New-feat
>> ure-regarding-single-pass-data-loading-solution-tp1761p1887.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>