Apache CarbonData Dev Mailing List archive - Re: Discussion(New feature) regarding single pass data loading solution.

Apache CarbonData Dev Mailing List archive

Re: Discussion(New feature) regarding single pass data loading solution.

Posted by Aniket Adnaik on Oct 12, 2016; 10:20pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1820.html

Hi Ravi,

1. I agree with Jihong that creation of global dictionary should be
optional, so that it can be disabled to improve the load performance. User
should be made aware that using global dictionary may boost the query
performance.
2. We should have a generic interface to manage global dictionary when its
from external sources. In general, it is not a good idea to depend on too
many external tools.
3. May be we should allow user to generate global dictionary separately
through SQL command or similar. Something like materialized view. This
means carbon should avoid using local dictionary and do late
materialization when global dictionary is present.
4. May be we should think of some ways to create global dictionary lazily
as we serve SELECT queries. Implementation may not be that straight
forward. Not sure if its worth the effort.

Best Regards,
Aniket

On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <[hidden email]> wrote:

>
> A rather straight option is allow user to supply global dictionary
> generated somewhere else or we build a separate tool just for generating as
> well updating dictionary. Then the general normal data loading process will
> encode columns with local dictionary if not supplied. This should cover
> majority of cases for low-medium cardinality column. For the cases we have
> to incorporate online dictionary update, use a lock mechanism to sync up
> should serve the purpose.
>
> In another words, generating global dictionary is an optional step, only
> triggered when needed, not a default step as we do currently.
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:[hidden email]]
> Sent: Tuesday, October 11, 2016 2:33 AM
> To: dev
> Subject: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi All,
>
> This discussion is regarding single pass data load solution.
>
> Currently data is loading to carbon in 2 pass/jobs
> 1. Generating global dictionary using spark job.
> 2. Encode the data with dictionary values and create carbondata files.
> This 2 pass solution has many disadvantages like it needs to read the data
> twice in case of csv files input or it needs to execute dataframe twice if
> data is loaded from dataframe.
>
> In order to overcome from above issues of 2 pass dataloading, we can have
> single pass dataloading and following are the alternate solutions.
>
> Use local dictionary
> Use local dictionary for each carbondata file while loading data, but it
> may lead to query performance degradation and more memory footprint.
>
> Use KV store/distributed map.
> *HBase/Cassandra cluster : *
> Dictionary data would be stored in KV store and generates the dictionary
> value if it is not present in it. We all know the pros/cons of Hbase but
> following are few.
> Pros : These are apache licensed
> Easy to implement to store/retreive dictionary values.
> Performance need to be evaluated.
>
> Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
>
> *Hazlecast distributed map : *
> Dictionary data could be saved in distributed concurrent hash map of
> hazlecast. It is in-memory map and partioned as per number of nodes. And
> even we can maintain the backups using sync/async functionality to avoid
> the data loss when instance is down. We no need to maintain seperate
> cluster for it as it can run on executor jvm itself.
> Pros: It is apache licensed.
> No need to maintain seperate cluster as instances can run in
> executor jvms.
> Easy to implement and store/retreive dictionary values.
> It is pure java implementation.
> There is no master/slave concept and no single point failure.
>
> Cons: Performance need to be evaluated.
>
> *Redis distributed map : *
> It is also in-memory map but it is coded in c language so we should
> have java client libraries to interact with redis. Need to maintain
> seperate cluster for it. It also can partition the data.
> Pros : More feature rich than Hazlecast.
> Easy to implement and store/retreive dictionary values.
> Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
> May not be suitable for big data stack.
> It is BSD licensed (Not sure whether we can use or not)
> Online performance figures says it is little slower than hazlecast.
>
> Please let me know which would be best fit for our loading solution. And
> please add any other suitable solution if I missed.
> --
> Thanks & Regards,
> Ravi
>