Discussion(New feature) regarding single pass data loading solution.

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Discussion(New feature) regarding single pass data loading solution.

ravipesala
Hi All,

This discussion is regarding single pass data load solution.

Currently data is loading to carbon in 2 pass/jobs
 1. Generating global dictionary using spark job.
 2. Encode the data with dictionary values and create carbondata files.
This 2 pass solution has many disadvantages like it needs to read the data
twice in case of csv files input or it needs to execute dataframe twice if
data is loaded from dataframe.

In order to overcome from above issues of 2 pass dataloading, we can have
single pass dataloading and following are the alternate solutions.

Use local dictionary
 Use local dictionary for each carbondata file while loading data, but it
may lead to query performance degradation and more memory footprint.

Use KV store/distributed map.
*HBase/Cassandra cluster : *
  Dictionary data would be stored in KV store and generates the dictionary
value if it is not present in it. We all know the pros/cons of Hbase but
following are few.
  Pros : These are apache licensed
         Easy to implement to store/retreive dictionary values.
         Performance need to be evaluated.

  Cons : Need to maintain seperate cluster for maintaining global
dictionary.

*Hazlecast distributed map : *
  Dictionary data could be saved in distributed concurrent hash map of
hazlecast. It is in-memory map and partioned as per number of nodes. And
even we can maintain the backups using sync/async functionality to avoid
the data loss when instance is down. We no need to maintain seperate
cluster for it as it can run on executor jvm itself.
  Pros: It is apache licensed.
        No need to maintain seperate cluster as instances can run in
executor jvms.
        Easy to implement and store/retreive dictionary values.
        It is pure java implementation.
        There is no master/slave concept and no single point failure.

  Cons: Performance need to be evaluated.

*Redis distributed map : *
    It is also in-memory map but it is coded in c language so we should
have java client libraries to interact with redis. Need to maintain
seperate cluster for it. It also can partition the data.
  Pros : More feature rich than Hazlecast.
         Easy to implement and store/retreive dictionary values.
  Cons : Need to maintain seperate cluster for maintaining global
dictionary.
         May not be suitable for big data stack.
         It is BSD licensed (Not sure whether we can use or not)
  Online performance figures says it is little slower than hazlecast.

Please let me know which would be best fit for our loading solution. And
please add any other suitable solution if I missed.
--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Jihong Ma

A rather straight option is allow user to supply global dictionary generated somewhere else or we build a separate tool just for generating as well updating dictionary. Then the general normal data loading process will encode columns with local dictionary if not supplied.  This should cover majority of cases for low-medium cardinality column. For the cases we have to incorporate online dictionary update, use a lock mechanism to sync up should serve the purpose.

In another words, generating global dictionary is an optional step, only triggered when needed, not a default step as we do currently.

Jihong

-----Original Message-----
From: Ravindra Pesala [mailto:[hidden email]]
Sent: Tuesday, October 11, 2016 2:33 AM
To: dev
Subject: Discussion(New feature) regarding single pass data loading solution.

Hi All,

This discussion is regarding single pass data load solution.

Currently data is loading to carbon in 2 pass/jobs
 1. Generating global dictionary using spark job.
 2. Encode the data with dictionary values and create carbondata files.
This 2 pass solution has many disadvantages like it needs to read the data
twice in case of csv files input or it needs to execute dataframe twice if
data is loaded from dataframe.

In order to overcome from above issues of 2 pass dataloading, we can have
single pass dataloading and following are the alternate solutions.

Use local dictionary
 Use local dictionary for each carbondata file while loading data, but it
may lead to query performance degradation and more memory footprint.

Use KV store/distributed map.
*HBase/Cassandra cluster : *
  Dictionary data would be stored in KV store and generates the dictionary
value if it is not present in it. We all know the pros/cons of Hbase but
following are few.
  Pros : These are apache licensed
         Easy to implement to store/retreive dictionary values.
         Performance need to be evaluated.

  Cons : Need to maintain seperate cluster for maintaining global
dictionary.

*Hazlecast distributed map : *
  Dictionary data could be saved in distributed concurrent hash map of
hazlecast. It is in-memory map and partioned as per number of nodes. And
even we can maintain the backups using sync/async functionality to avoid
the data loss when instance is down. We no need to maintain seperate
cluster for it as it can run on executor jvm itself.
  Pros: It is apache licensed.
        No need to maintain seperate cluster as instances can run in
executor jvms.
        Easy to implement and store/retreive dictionary values.
        It is pure java implementation.
        There is no master/slave concept and no single point failure.

  Cons: Performance need to be evaluated.

*Redis distributed map : *
    It is also in-memory map but it is coded in c language so we should
have java client libraries to interact with redis. Need to maintain
seperate cluster for it. It also can partition the data.
  Pros : More feature rich than Hazlecast.
         Easy to implement and store/retreive dictionary values.
  Cons : Need to maintain seperate cluster for maintaining global
dictionary.
         May not be suitable for big data stack.
         It is BSD licensed (Not sure whether we can use or not)
  Online performance figures says it is little slower than hazlecast.

Please let me know which would be best fit for our loading solution. And
please add any other suitable solution if I missed.
--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Aniket Adnaik
Hi Ravi,

1. I agree with Jihong that creation of global dictionary should be
optional, so that it can be disabled to improve the load performance. User
should be made aware that using global dictionary may boost the query
performance.
2. We should have a generic interface to manage global dictionary when its
from external sources. In general, it is not a good idea to depend on too
many external tools.
3. May be we should allow user to generate global dictionary separately
through SQL command or similar. Something like materialized view. This
means carbon should avoid using local dictionary and do late
materialization when global dictionary is present.
4. May be we should think of some ways to create global dictionary lazily
as we serve SELECT queries. Implementation may not be that straight
forward. Not sure if its worth the effort.

Best Regards,
Aniket


On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <[hidden email]> wrote:

>
> A rather straight option is allow user to supply global dictionary
> generated somewhere else or we build a separate tool just for generating as
> well updating dictionary. Then the general normal data loading process will
> encode columns with local dictionary if not supplied.  This should cover
> majority of cases for low-medium cardinality column. For the cases we have
> to incorporate online dictionary update, use a lock mechanism to sync up
> should serve the purpose.
>
> In another words, generating global dictionary is an optional step, only
> triggered when needed, not a default step as we do currently.
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:[hidden email]]
> Sent: Tuesday, October 11, 2016 2:33 AM
> To: dev
> Subject: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi All,
>
> This discussion is regarding single pass data load solution.
>
> Currently data is loading to carbon in 2 pass/jobs
>  1. Generating global dictionary using spark job.
>  2. Encode the data with dictionary values and create carbondata files.
> This 2 pass solution has many disadvantages like it needs to read the data
> twice in case of csv files input or it needs to execute dataframe twice if
> data is loaded from dataframe.
>
> In order to overcome from above issues of 2 pass dataloading, we can have
> single pass dataloading and following are the alternate solutions.
>
> Use local dictionary
>  Use local dictionary for each carbondata file while loading data, but it
> may lead to query performance degradation and more memory footprint.
>
> Use KV store/distributed map.
> *HBase/Cassandra cluster : *
>   Dictionary data would be stored in KV store and generates the dictionary
> value if it is not present in it. We all know the pros/cons of Hbase but
> following are few.
>   Pros : These are apache licensed
>          Easy to implement to store/retreive dictionary values.
>          Performance need to be evaluated.
>
>   Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
>
> *Hazlecast distributed map : *
>   Dictionary data could be saved in distributed concurrent hash map of
> hazlecast. It is in-memory map and partioned as per number of nodes. And
> even we can maintain the backups using sync/async functionality to avoid
> the data loss when instance is down. We no need to maintain seperate
> cluster for it as it can run on executor jvm itself.
>   Pros: It is apache licensed.
>         No need to maintain seperate cluster as instances can run in
> executor jvms.
>         Easy to implement and store/retreive dictionary values.
>         It is pure java implementation.
>         There is no master/slave concept and no single point failure.
>
>   Cons: Performance need to be evaluated.
>
> *Redis distributed map : *
>     It is also in-memory map but it is coded in c language so we should
> have java client libraries to interact with redis. Need to maintain
> seperate cluster for it. It also can partition the data.
>   Pros : More feature rich than Hazlecast.
>          Easy to implement and store/retreive dictionary values.
>   Cons : Need to maintain seperate cluster for maintaining global
> dictionary.
>          May not be suitable for big data stack.
>          It is BSD licensed (Not sure whether we can use or not)
>   Online performance figures says it is little slower than hazlecast.
>
> Please let me know which would be best fit for our loading solution. And
> please add any other suitable solution if I missed.
> --
> Thanks & Regards,
> Ravi
>
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Qingqing Zhou
In reply to this post by ravipesala
On Tue, Oct 11, 2016 at 2:32 AM, Ravindra Pesala <[hidden email]>
wrote:
> Currently data is loading to carbon in 2 pass/jobs
>  1. Generating global dictionary using spark job.

Do we have local dictionaries? If not, what if the column has many
distinct values - will the big dictionary loaded into memory?

Regards,
Qingqing
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

ravipesala
In reply to this post by Aniket Adnaik
Hi Jihong/Aniket,

In the current implementation of carbondata we are already handling
external dictionary while loading the data.
But here the question is what would be the default implementation? Load
data with out dictionary?


Regards,
Ravi

On 13 October 2016 at 03:50, Aniket Adnaik <[hidden email]> wrote:

> Hi Ravi,
>
> 1. I agree with Jihong that creation of global dictionary should be
> optional, so that it can be disabled to improve the load performance. User
> should be made aware that using global dictionary may boost the query
> performance.
> 2. We should have a generic interface to manage global dictionary when its
> from external sources. In general, it is not a good idea to depend on too
> many external tools.
> 3. May be we should allow user to generate global dictionary separately
> through SQL command or similar. Something like materialized view. This
> means carbon should avoid using local dictionary and do late
> materialization when global dictionary is present.
> 4. May be we should think of some ways to create global dictionary lazily
> as we serve SELECT queries. Implementation may not be that straight
> forward. Not sure if its worth the effort.
>
> Best Regards,
> Aniket
>
>
> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <[hidden email]> wrote:
>
> >
> > A rather straight option is allow user to supply global dictionary
> > generated somewhere else or we build a separate tool just for generating
> as
> > well updating dictionary. Then the general normal data loading process
> will
> > encode columns with local dictionary if not supplied.  This should cover
> > majority of cases for low-medium cardinality column. For the cases we
> have
> > to incorporate online dictionary update, use a lock mechanism to sync up
> > should serve the purpose.
> >
> > In another words, generating global dictionary is an optional step, only
> > triggered when needed, not a default step as we do currently.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:[hidden email]]
> > Sent: Tuesday, October 11, 2016 2:33 AM
> > To: dev
> > Subject: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi All,
> >
> > This discussion is regarding single pass data load solution.
> >
> > Currently data is loading to carbon in 2 pass/jobs
> >  1. Generating global dictionary using spark job.
> >  2. Encode the data with dictionary values and create carbondata files.
> > This 2 pass solution has many disadvantages like it needs to read the
> data
> > twice in case of csv files input or it needs to execute dataframe twice
> if
> > data is loaded from dataframe.
> >
> > In order to overcome from above issues of 2 pass dataloading, we can have
> > single pass dataloading and following are the alternate solutions.
> >
> > Use local dictionary
> >  Use local dictionary for each carbondata file while loading data, but it
> > may lead to query performance degradation and more memory footprint.
> >
> > Use KV store/distributed map.
> > *HBase/Cassandra cluster : *
> >   Dictionary data would be stored in KV store and generates the
> dictionary
> > value if it is not present in it. We all know the pros/cons of Hbase but
> > following are few.
> >   Pros : These are apache licensed
> >          Easy to implement to store/retreive dictionary values.
> >          Performance need to be evaluated.
> >
> >   Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >
> > *Hazlecast distributed map : *
> >   Dictionary data could be saved in distributed concurrent hash map of
> > hazlecast. It is in-memory map and partioned as per number of nodes. And
> > even we can maintain the backups using sync/async functionality to avoid
> > the data loss when instance is down. We no need to maintain seperate
> > cluster for it as it can run on executor jvm itself.
> >   Pros: It is apache licensed.
> >         No need to maintain seperate cluster as instances can run in
> > executor jvms.
> >         Easy to implement and store/retreive dictionary values.
> >         It is pure java implementation.
> >         There is no master/slave concept and no single point failure.
> >
> >   Cons: Performance need to be evaluated.
> >
> > *Redis distributed map : *
> >     It is also in-memory map but it is coded in c language so we should
> > have java client libraries to interact with redis. Need to maintain
> > seperate cluster for it. It also can partition the data.
> >   Pros : More feature rich than Hazlecast.
> >          Easy to implement and store/retreive dictionary values.
> >   Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >          May not be suitable for big data stack.
> >          It is BSD licensed (Not sure whether we can use or not)
> >   Online performance figures says it is little slower than hazlecast.
> >
> > Please let me know which would be best fit for our loading solution. And
> > please add any other suitable solution if I missed.
> > --
> > Thanks & Regards,
> > Ravi
> >
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Jihong Ma
>>>>the question is what would be the default implementation? Load data without dictionary?

My thought is we can provide a tool to generate global dictionary using sample data set, so the initial global dictionaries is available before normal data loading. We shall be able to perform encoding based on that, we only need to handle occasionally adding entries while loading. For columns specified with global dictionary encoding, but dictionary is not placed before data loading, we error out and direct user to use the tool first.

Make sense?

Jihong

-----Original Message-----
From: Ravindra Pesala [mailto:[hidden email]]
Sent: Thursday, October 13, 2016 1:12 AM
To: dev
Subject: Re: Discussion(New feature) regarding single pass data loading solution.

Hi Jihong/Aniket,

In the current implementation of carbondata we are already handling
external dictionary while loading the data.
But here the question is what would be the default implementation? Load
data with out dictionary?


Regards,
Ravi

On 13 October 2016 at 03:50, Aniket Adnaik <[hidden email]> wrote:

> Hi Ravi,
>
> 1. I agree with Jihong that creation of global dictionary should be
> optional, so that it can be disabled to improve the load performance. User
> should be made aware that using global dictionary may boost the query
> performance.
> 2. We should have a generic interface to manage global dictionary when its
> from external sources. In general, it is not a good idea to depend on too
> many external tools.
> 3. May be we should allow user to generate global dictionary separately
> through SQL command or similar. Something like materialized view. This
> means carbon should avoid using local dictionary and do late
> materialization when global dictionary is present.
> 4. May be we should think of some ways to create global dictionary lazily
> as we serve SELECT queries. Implementation may not be that straight
> forward. Not sure if its worth the effort.
>
> Best Regards,
> Aniket
>
>
> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <[hidden email]> wrote:
>
> >
> > A rather straight option is allow user to supply global dictionary
> > generated somewhere else or we build a separate tool just for generating
> as
> > well updating dictionary. Then the general normal data loading process
> will
> > encode columns with local dictionary if not supplied.  This should cover
> > majority of cases for low-medium cardinality column. For the cases we
> have
> > to incorporate online dictionary update, use a lock mechanism to sync up
> > should serve the purpose.
> >
> > In another words, generating global dictionary is an optional step, only
> > triggered when needed, not a default step as we do currently.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:[hidden email]]
> > Sent: Tuesday, October 11, 2016 2:33 AM
> > To: dev
> > Subject: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi All,
> >
> > This discussion is regarding single pass data load solution.
> >
> > Currently data is loading to carbon in 2 pass/jobs
> >  1. Generating global dictionary using spark job.
> >  2. Encode the data with dictionary values and create carbondata files.
> > This 2 pass solution has many disadvantages like it needs to read the
> data
> > twice in case of csv files input or it needs to execute dataframe twice
> if
> > data is loaded from dataframe.
> >
> > In order to overcome from above issues of 2 pass dataloading, we can have
> > single pass dataloading and following are the alternate solutions.
> >
> > Use local dictionary
> >  Use local dictionary for each carbondata file while loading data, but it
> > may lead to query performance degradation and more memory footprint.
> >
> > Use KV store/distributed map.
> > *HBase/Cassandra cluster : *
> >   Dictionary data would be stored in KV store and generates the
> dictionary
> > value if it is not present in it. We all know the pros/cons of Hbase but
> > following are few.
> >   Pros : These are apache licensed
> >          Easy to implement to store/retreive dictionary values.
> >          Performance need to be evaluated.
> >
> >   Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >
> > *Hazlecast distributed map : *
> >   Dictionary data could be saved in distributed concurrent hash map of
> > hazlecast. It is in-memory map and partioned as per number of nodes. And
> > even we can maintain the backups using sync/async functionality to avoid
> > the data loss when instance is down. We no need to maintain seperate
> > cluster for it as it can run on executor jvm itself.
> >   Pros: It is apache licensed.
> >         No need to maintain seperate cluster as instances can run in
> > executor jvms.
> >         Easy to implement and store/retreive dictionary values.
> >         It is pure java implementation.
> >         There is no master/slave concept and no single point failure.
> >
> >   Cons: Performance need to be evaluated.
> >
> > *Redis distributed map : *
> >     It is also in-memory map but it is coded in c language so we should
> > have java client libraries to interact with redis. Need to maintain
> > seperate cluster for it. It also can partition the data.
> >   Pros : More feature rich than Hazlecast.
> >          Easy to implement and store/retreive dictionary values.
> >   Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >          May not be suitable for big data stack.
> >          It is BSD licensed (Not sure whether we can use or not)
> >   Online performance figures says it is little slower than hazlecast.
> >
> > Please let me know which would be best fit for our loading solution. And
> > please add any other suitable solution if I missed.
> > --
> > Thanks & Regards,
> > Ravi
> >
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Liang Chen
Administrator
Hi jihong

I am not sure that users can accept to use extra tool to do this work, because provide tool or do scan at first time per table for most of global dict are same cost from users perspective, and maintain the dict file also be same cost, they always expecting that system can automatically and internally generate dict file during loading data.

Can we consider this:
first load: make scan to generate most of global dict file, then copy this file to each load node for subsequent loading

Regards
Liang

Jihong Ma wrote
>>>>the question is what would be the default implementation? Load data without dictionary?

My thought is we can provide a tool to generate global dictionary using sample data set, so the initial global dictionaries is available before normal data loading. We shall be able to perform encoding based on that, we only need to handle occasionally adding entries while loading. For columns specified with global dictionary encoding, but dictionary is not placed before data loading, we error out and direct user to use the tool first.

Make sense?

Jihong

-----Original Message-----
From: Ravindra Pesala [mailto:[hidden email]]
Sent: Thursday, October 13, 2016 1:12 AM
To: dev
Subject: Re: Discussion(New feature) regarding single pass data loading solution.

Hi Jihong/Aniket,

In the current implementation of carbondata we are already handling
external dictionary while loading the data.
But here the question is what would be the default implementation? Load
data with out dictionary?


Regards,
Ravi

On 13 October 2016 at 03:50, Aniket Adnaik <[hidden email]> wrote:

> Hi Ravi,
>
> 1. I agree with Jihong that creation of global dictionary should be
> optional, so that it can be disabled to improve the load performance. User
> should be made aware that using global dictionary may boost the query
> performance.
> 2. We should have a generic interface to manage global dictionary when its
> from external sources. In general, it is not a good idea to depend on too
> many external tools.
> 3. May be we should allow user to generate global dictionary separately
> through SQL command or similar. Something like materialized view. This
> means carbon should avoid using local dictionary and do late
> materialization when global dictionary is present.
> 4. May be we should think of some ways to create global dictionary lazily
> as we serve SELECT queries. Implementation may not be that straight
> forward. Not sure if its worth the effort.
>
> Best Regards,
> Aniket
>
>
> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <[hidden email]> wrote:
>
> >
> > A rather straight option is allow user to supply global dictionary
> > generated somewhere else or we build a separate tool just for generating
> as
> > well updating dictionary. Then the general normal data loading process
> will
> > encode columns with local dictionary if not supplied.  This should cover
> > majority of cases for low-medium cardinality column. For the cases we
> have
> > to incorporate online dictionary update, use a lock mechanism to sync up
> > should serve the purpose.
> >
> > In another words, generating global dictionary is an optional step, only
> > triggered when needed, not a default step as we do currently.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:[hidden email]]
> > Sent: Tuesday, October 11, 2016 2:33 AM
> > To: dev
> > Subject: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi All,
> >
> > This discussion is regarding single pass data load solution.
> >
> > Currently data is loading to carbon in 2 pass/jobs
> >  1. Generating global dictionary using spark job.
> >  2. Encode the data with dictionary values and create carbondata files.
> > This 2 pass solution has many disadvantages like it needs to read the
> data
> > twice in case of csv files input or it needs to execute dataframe twice
> if
> > data is loaded from dataframe.
> >
> > In order to overcome from above issues of 2 pass dataloading, we can have
> > single pass dataloading and following are the alternate solutions.
> >
> > Use local dictionary
> >  Use local dictionary for each carbondata file while loading data, but it
> > may lead to query performance degradation and more memory footprint.
> >
> > Use KV store/distributed map.
> > *HBase/Cassandra cluster : *
> >   Dictionary data would be stored in KV store and generates the
> dictionary
> > value if it is not present in it. We all know the pros/cons of Hbase but
> > following are few.
> >   Pros : These are apache licensed
> >          Easy to implement to store/retreive dictionary values.
> >          Performance need to be evaluated.
> >
> >   Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >
> > *Hazlecast distributed map : *
> >   Dictionary data could be saved in distributed concurrent hash map of
> > hazlecast. It is in-memory map and partioned as per number of nodes. And
> > even we can maintain the backups using sync/async functionality to avoid
> > the data loss when instance is down. We no need to maintain seperate
> > cluster for it as it can run on executor jvm itself.
> >   Pros: It is apache licensed.
> >         No need to maintain seperate cluster as instances can run in
> > executor jvms.
> >         Easy to implement and store/retreive dictionary values.
> >         It is pure java implementation.
> >         There is no master/slave concept and no single point failure.
> >
> >   Cons: Performance need to be evaluated.
> >
> > *Redis distributed map : *
> >     It is also in-memory map but it is coded in c language so we should
> > have java client libraries to interact with redis. Need to maintain
> > seperate cluster for it. It also can partition the data.
> >   Pros : More feature rich than Hazlecast.
> >          Easy to implement and store/retreive dictionary values.
> >   Cons : Need to maintain seperate cluster for maintaining global
> > dictionary.
> >          May not be suitable for big data stack.
> >          It is BSD licensed (Not sure whether we can use or not)
> >   Online performance figures says it is little slower than hazlecast.
> >
> > Please let me know which would be best fit for our loading solution. And
> > please add any other suitable solution if I missed.
> > --
> > Thanks & Regards,
> > Ravi
> >
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Aniket Adnaik
I have following comments;

1. If external dictionary is provided, we accept it. This interface should
be generic enough, so that we can perform lookup, add, delete, create and
drop  functionality. I believe we already have this functionality to some
extent. As long as we are able to maintain the dictionary it should be fine.
2. If external dictionary is not provided, then by default we should build
it internally, which is our current behavior.This will continue to impact
the load performance though.
3. If load performance is not acceptable, we should allow user to disable
building of global dictionary. Carbon should build local dictionary
instead. Will this setting apply to all subsequent loads ? may be yes for
now.
4. If User decides to build dictionary at later point, either via external
tool
or using carbon sql command ("CREATE  DICTIONARY TABLE...") we should
provide that facility. This will help user to improve query performance
through late materialization. The local dictionary will not be used in this
case. Sebsequent loads
 will continue to add new entries to this new dictionary (external or
carbon specific).

This doesn't really solve our double pass problem, but kind of works around
it by isolating dictionary building operation out of critical path.


Best Regards,
Aniket


On Thu, Oct 13, 2016 at 5:39 PM, Liang Chen <[hidden email]> wrote:

> Hi jihong
>
> I am not sure that users can accept to use extra tool to do this work,
> because provide tool or do scan at first time per table for most of global
> dict are same cost from users perspective, and maintain the dict file also
> be same cost, they always expecting that system can automatically and
> internally generate dict file during loading data.
>
> Can we consider this:
> first load: make scan to generate most of global dict file, then copy this
> file to each load node for subsequent loading
>
> Regards
> Liang
>
>
> Jihong Ma wrote
> >>>>>the question is what would be the default implementation? Load data
> without dictionary?
> >
> > My thought is we can provide a tool to generate global dictionary using
> > sample data set, so the initial global dictionaries is available before
> > normal data loading. We shall be able to perform encoding based on that,
> > we only need to handle occasionally adding entries while loading. For
> > columns specified with global dictionary encoding, but dictionary is not
> > placed before data loading, we error out and direct user to use the tool
> > first.
> >
> > Make sense?
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> > Sent: Thursday, October 13, 2016 1:12 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi Jihong/Aniket,
> >
> > In the current implementation of carbondata we are already handling
> > external dictionary while loading the data.
> > But here the question is what would be the default implementation? Load
> > data with out dictionary?
> >
> >
> > Regards,
> > Ravi
> >
> > On 13 October 2016 at 03:50, Aniket Adnaik &lt;
>
> > aniket.adnaik@
>
> > &gt; wrote:
> >
> >> Hi Ravi,
> >>
> >> 1. I agree with Jihong that creation of global dictionary should be
> >> optional, so that it can be disabled to improve the load performance.
> >> User
> >> should be made aware that using global dictionary may boost the query
> >> performance.
> >> 2. We should have a generic interface to manage global dictionary when
> >> its
> >> from external sources. In general, it is not a good idea to depend on
> too
> >> many external tools.
> >> 3. May be we should allow user to generate global dictionary separately
> >> through SQL command or similar. Something like materialized view. This
> >> means carbon should avoid using local dictionary and do late
> >> materialization when global dictionary is present.
> >> 4. May be we should think of some ways to create global dictionary
> lazily
> >> as we serve SELECT queries. Implementation may not be that straight
> >> forward. Not sure if its worth the effort.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >>
> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
>
> > Jihong.Ma@
>
> > &gt; wrote:
> >>
> >> >
> >> > A rather straight option is allow user to supply global dictionary
> >> > generated somewhere else or we build a separate tool just for
> >> generating
> >> as
> >> > well updating dictionary. Then the general normal data loading process
> >> will
> >> > encode columns with local dictionary if not supplied.  This should
> >> cover
> >> > majority of cases for low-medium cardinality column. For the cases we
> >> have
> >> > to incorporate online dictionary update, use a lock mechanism to sync
> >> up
> >> > should serve the purpose.
> >> >
> >> > In another words, generating global dictionary is an optional step,
> >> only
> >> > triggered when needed, not a default step as we do currently.
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> > To: dev
> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> > solution.
> >> >
> >> > Hi All,
> >> >
> >> > This discussion is regarding single pass data load solution.
> >> >
> >> > Currently data is loading to carbon in 2 pass/jobs
> >> >  1. Generating global dictionary using spark job.
> >> >  2. Encode the data with dictionary values and create carbondata
> files.
> >> > This 2 pass solution has many disadvantages like it needs to read the
> >> data
> >> > twice in case of csv files input or it needs to execute dataframe
> twice
> >> if
> >> > data is loaded from dataframe.
> >> >
> >> > In order to overcome from above issues of 2 pass dataloading, we can
> >> have
> >> > single pass dataloading and following are the alternate solutions.
> >> >
> >> > Use local dictionary
> >> >  Use local dictionary for each carbondata file while loading data, but
> >> it
> >> > may lead to query performance degradation and more memory footprint.
> >> >
> >> > Use KV store/distributed map.
> >> > *HBase/Cassandra cluster : *
> >> >   Dictionary data would be stored in KV store and generates the
> >> dictionary
> >> > value if it is not present in it. We all know the pros/cons of Hbase
> >> but
> >> > following are few.
> >> >   Pros : These are apache licensed
> >> >          Easy to implement to store/retreive dictionary values.
> >> >          Performance need to be evaluated.
> >> >
> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >
> >> > *Hazlecast distributed map : *
> >> >   Dictionary data could be saved in distributed concurrent hash map of
> >> > hazlecast. It is in-memory map and partioned as per number of nodes.
> >> And
> >> > even we can maintain the backups using sync/async functionality to
> >> avoid
> >> > the data loss when instance is down. We no need to maintain seperate
> >> > cluster for it as it can run on executor jvm itself.
> >> >   Pros: It is apache licensed.
> >> >         No need to maintain seperate cluster as instances can run in
> >> > executor jvms.
> >> >         Easy to implement and store/retreive dictionary values.
> >> >         It is pure java implementation.
> >> >         There is no master/slave concept and no single point failure.
> >> >
> >> >   Cons: Performance need to be evaluated.
> >> >
> >> > *Redis distributed map : *
> >> >     It is also in-memory map but it is coded in c language so we
> should
> >> > have java client libraries to interact with redis. Need to maintain
> >> > seperate cluster for it. It also can partition the data.
> >> >   Pros : More feature rich than Hazlecast.
> >> >          Easy to implement and store/retreive dictionary values.
> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >          May not be suitable for big data stack.
> >> >          It is BSD licensed (Not sure whether we can use or not)
> >> >   Online performance figures says it is little slower than hazlecast.
> >> >
> >> > Please let me know which would be best fit for our loading solution.
> >> And
> >> > please add any other suitable solution if I missed.
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >> >
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Aniket Adnaik
After rethinking at point 4 in my previous email;
It will be very expensive to rebuild and re-encode the values , so may not
be a viable option. only future loads can benefit from it. But then will
end up having some segments using global dictionary and some using local
dictionary. May be we should not consider this option.

Best Regards,
Aniket


On Thu, Oct 13, 2016 at 5:54 PM, Aniket Adnaik <[hidden email]>
wrote:

> I have following comments;
>
> 1. If external dictionary is provided, we accept it. This interface should
> be generic enough, so that we can perform lookup, add, delete, create and
> drop  functionality. I believe we already have this functionality to some
> extent. As long as we are able to maintain the dictionary it should be fine.
> 2. If external dictionary is not provided, then by default we should build
> it internally, which is our current behavior.This will continue to impact
> the load performance though.
> 3. If load performance is not acceptable, we should allow user to disable
> building of global dictionary. Carbon should build local dictionary
> instead. Will this setting apply to all subsequent loads ? may be yes for
> now.
> 4. If User decides to build dictionary at later point, either via external
> tool
> or using carbon sql command ("CREATE  DICTIONARY TABLE...") we should
> provide that facility. This will help user to improve query performance
> through late materialization. The local dictionary will not be used in this
> case. Sebsequent loads
>  will continue to add new entries to this new dictionary (external or
> carbon specific).
>
> This doesn't really solve our double pass problem, but kind of works
> around it by isolating dictionary building operation out of critical path.
>
>
> Best Regards,
> Aniket
>
>
> On Thu, Oct 13, 2016 at 5:39 PM, Liang Chen <[hidden email]>
> wrote:
>
>> Hi jihong
>>
>> I am not sure that users can accept to use extra tool to do this work,
>> because provide tool or do scan at first time per table for most of global
>> dict are same cost from users perspective, and maintain the dict file also
>> be same cost, they always expecting that system can automatically and
>> internally generate dict file during loading data.
>>
>> Can we consider this:
>> first load: make scan to generate most of global dict file, then copy this
>> file to each load node for subsequent loading
>>
>> Regards
>> Liang
>>
>>
>> Jihong Ma wrote
>> >>>>>the question is what would be the default implementation? Load data
>> without dictionary?
>> >
>> > My thought is we can provide a tool to generate global dictionary using
>> > sample data set, so the initial global dictionaries is available before
>> > normal data loading. We shall be able to perform encoding based on that,
>> > we only need to handle occasionally adding entries while loading. For
>> > columns specified with global dictionary encoding, but dictionary is not
>> > placed before data loading, we error out and direct user to use the tool
>> > first.
>> >
>> > Make sense?
>> >
>> > Jihong
>> >
>> > -----Original Message-----
>> > From: Ravindra Pesala [mailto:
>>
>> > ravi.pesala@
>>
>> > ]
>> > Sent: Thursday, October 13, 2016 1:12 AM
>> > To: dev
>> > Subject: Re: Discussion(New feature) regarding single pass data loading
>> > solution.
>> >
>> > Hi Jihong/Aniket,
>> >
>> > In the current implementation of carbondata we are already handling
>> > external dictionary while loading the data.
>> > But here the question is what would be the default implementation? Load
>> > data with out dictionary?
>> >
>> >
>> > Regards,
>> > Ravi
>> >
>> > On 13 October 2016 at 03:50, Aniket Adnaik &lt;
>>
>> > aniket.adnaik@
>>
>> > &gt; wrote:
>> >
>> >> Hi Ravi,
>> >>
>> >> 1. I agree with Jihong that creation of global dictionary should be
>> >> optional, so that it can be disabled to improve the load performance.
>> >> User
>> >> should be made aware that using global dictionary may boost the query
>> >> performance.
>> >> 2. We should have a generic interface to manage global dictionary when
>> >> its
>> >> from external sources. In general, it is not a good idea to depend on
>> too
>> >> many external tools.
>> >> 3. May be we should allow user to generate global dictionary separately
>> >> through SQL command or similar. Something like materialized view. This
>> >> means carbon should avoid using local dictionary and do late
>> >> materialization when global dictionary is present.
>> >> 4. May be we should think of some ways to create global dictionary
>> lazily
>> >> as we serve SELECT queries. Implementation may not be that straight
>> >> forward. Not sure if its worth the effort.
>> >>
>> >> Best Regards,
>> >> Aniket
>> >>
>> >>
>> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
>>
>> > Jihong.Ma@
>>
>> > &gt; wrote:
>> >>
>> >> >
>> >> > A rather straight option is allow user to supply global dictionary
>> >> > generated somewhere else or we build a separate tool just for
>> >> generating
>> >> as
>> >> > well updating dictionary. Then the general normal data loading
>> process
>> >> will
>> >> > encode columns with local dictionary if not supplied.  This should
>> >> cover
>> >> > majority of cases for low-medium cardinality column. For the cases we
>> >> have
>> >> > to incorporate online dictionary update, use a lock mechanism to sync
>> >> up
>> >> > should serve the purpose.
>> >> >
>> >> > In another words, generating global dictionary is an optional step,
>> >> only
>> >> > triggered when needed, not a default step as we do currently.
>> >> >
>> >> > Jihong
>> >> >
>> >> > -----Original Message-----
>> >> > From: Ravindra Pesala [mailto:
>>
>> > ravi.pesala@
>>
>> > ]
>> >> > Sent: Tuesday, October 11, 2016 2:33 AM
>> >> > To: dev
>> >> > Subject: Discussion(New feature) regarding single pass data loading
>> >> > solution.
>> >> >
>> >> > Hi All,
>> >> >
>> >> > This discussion is regarding single pass data load solution.
>> >> >
>> >> > Currently data is loading to carbon in 2 pass/jobs
>> >> >  1. Generating global dictionary using spark job.
>> >> >  2. Encode the data with dictionary values and create carbondata
>> files.
>> >> > This 2 pass solution has many disadvantages like it needs to read the
>> >> data
>> >> > twice in case of csv files input or it needs to execute dataframe
>> twice
>> >> if
>> >> > data is loaded from dataframe.
>> >> >
>> >> > In order to overcome from above issues of 2 pass dataloading, we can
>> >> have
>> >> > single pass dataloading and following are the alternate solutions.
>> >> >
>> >> > Use local dictionary
>> >> >  Use local dictionary for each carbondata file while loading data,
>> but
>> >> it
>> >> > may lead to query performance degradation and more memory footprint.
>> >> >
>> >> > Use KV store/distributed map.
>> >> > *HBase/Cassandra cluster : *
>> >> >   Dictionary data would be stored in KV store and generates the
>> >> dictionary
>> >> > value if it is not present in it. We all know the pros/cons of Hbase
>> >> but
>> >> > following are few.
>> >> >   Pros : These are apache licensed
>> >> >          Easy to implement to store/retreive dictionary values.
>> >> >          Performance need to be evaluated.
>> >> >
>> >> >   Cons : Need to maintain seperate cluster for maintaining global
>> >> > dictionary.
>> >> >
>> >> > *Hazlecast distributed map : *
>> >> >   Dictionary data could be saved in distributed concurrent hash map
>> of
>> >> > hazlecast. It is in-memory map and partioned as per number of nodes.
>> >> And
>> >> > even we can maintain the backups using sync/async functionality to
>> >> avoid
>> >> > the data loss when instance is down. We no need to maintain seperate
>> >> > cluster for it as it can run on executor jvm itself.
>> >> >   Pros: It is apache licensed.
>> >> >         No need to maintain seperate cluster as instances can run in
>> >> > executor jvms.
>> >> >         Easy to implement and store/retreive dictionary values.
>> >> >         It is pure java implementation.
>> >> >         There is no master/slave concept and no single point failure.
>> >> >
>> >> >   Cons: Performance need to be evaluated.
>> >> >
>> >> > *Redis distributed map : *
>> >> >     It is also in-memory map but it is coded in c language so we
>> should
>> >> > have java client libraries to interact with redis. Need to maintain
>> >> > seperate cluster for it. It also can partition the data.
>> >> >   Pros : More feature rich than Hazlecast.
>> >> >          Easy to implement and store/retreive dictionary values.
>> >> >   Cons : Need to maintain seperate cluster for maintaining global
>> >> > dictionary.
>> >> >          May not be suitable for big data stack.
>> >> >          It is BSD licensed (Not sure whether we can use or not)
>> >> >   Online performance figures says it is little slower than hazlecast.
>> >> >
>> >> > Please let me know which would be best fit for our loading solution.
>> >> And
>> >> > please add any other suitable solution if I missed.
>> >> > --
>> >> > Thanks & Regards,
>> >> > Ravi
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-maili
>> ng-list-archive.1130556.n5.nabble.com/Discussion-New-feat
>> ure-regarding-single-pass-data-loading-solution-tp1761p1887.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Jihong Ma
In reply to this post by Liang Chen
Hi Liang,

This tool is more or less like the first load, the first time after table is created, any subsequent loads/incremental loads will proceed and is capable of updating the global dictionary when it encounters new value, this is easiest way of achieving 1 pass data loading process without too much overhead.

Since this tool is only triggered once per table, not considered too much burden on the end users. Making global dictionary generation out of the way of regular data loading is the key here.

Jihong

-----Original Message-----
From: Liang Chen [mailto:[hidden email]]
Sent: Thursday, October 13, 2016 5:39 PM
To: [hidden email]
Subject: RE: Discussion(New feature) regarding single pass data loading solution.

Hi jihong

I am not sure that users can accept to use extra tool to do this work,
because provide tool or do scan at first time per table for most of global
dict are same cost from users perspective, and maintain the dict file also
be same cost, they always expecting that system can automatically and
internally generate dict file during loading data.

Can we consider this:
first load: make scan to generate most of global dict file, then copy this
file to each load node for subsequent loading

Regards
Liang


Jihong Ma wrote
>>>>>the question is what would be the default implementation? Load data
without dictionary?

>
> My thought is we can provide a tool to generate global dictionary using
> sample data set, so the initial global dictionaries is available before
> normal data loading. We shall be able to perform encoding based on that,
> we only need to handle occasionally adding entries while loading. For
> columns specified with global dictionary encoding, but dictionary is not
> placed before data loading, we error out and direct user to use the tool
> first.
>
> Make sense?
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:

> ravi.pesala@

> ]
> Sent: Thursday, October 13, 2016 1:12 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi Jihong/Aniket,
>
> In the current implementation of carbondata we are already handling
> external dictionary while loading the data.
> But here the question is what would be the default implementation? Load
> data with out dictionary?
>
>
> Regards,
> Ravi
>
> On 13 October 2016 at 03:50, Aniket Adnaik &lt;

> aniket.adnaik@

> &gt; wrote:
>
>> Hi Ravi,
>>
>> 1. I agree with Jihong that creation of global dictionary should be
>> optional, so that it can be disabled to improve the load performance.
>> User
>> should be made aware that using global dictionary may boost the query
>> performance.
>> 2. We should have a generic interface to manage global dictionary when
>> its
>> from external sources. In general, it is not a good idea to depend on too
>> many external tools.
>> 3. May be we should allow user to generate global dictionary separately
>> through SQL command or similar. Something like materialized view. This
>> means carbon should avoid using local dictionary and do late
>> materialization when global dictionary is present.
>> 4. May be we should think of some ways to create global dictionary lazily
>> as we serve SELECT queries. Implementation may not be that straight
>> forward. Not sure if its worth the effort.
>>
>> Best Regards,
>> Aniket
>>
>>
>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;

> Jihong.Ma@

> &gt; wrote:
>>
>> >
>> > A rather straight option is allow user to supply global dictionary
>> > generated somewhere else or we build a separate tool just for
>> generating
>> as
>> > well updating dictionary. Then the general normal data loading process
>> will
>> > encode columns with local dictionary if not supplied.  This should
>> cover
>> > majority of cases for low-medium cardinality column. For the cases we
>> have
>> > to incorporate online dictionary update, use a lock mechanism to sync
>> up
>> > should serve the purpose.
>> >
>> > In another words, generating global dictionary is an optional step,
>> only
>> > triggered when needed, not a default step as we do currently.
>> >
>> > Jihong
>> >
>> > -----Original Message-----
>> > From: Ravindra Pesala [mailto:

> ravi.pesala@

> ]
>> > Sent: Tuesday, October 11, 2016 2:33 AM
>> > To: dev
>> > Subject: Discussion(New feature) regarding single pass data loading
>> > solution.
>> >
>> > Hi All,
>> >
>> > This discussion is regarding single pass data load solution.
>> >
>> > Currently data is loading to carbon in 2 pass/jobs
>> >  1. Generating global dictionary using spark job.
>> >  2. Encode the data with dictionary values and create carbondata files.
>> > This 2 pass solution has many disadvantages like it needs to read the
>> data
>> > twice in case of csv files input or it needs to execute dataframe twice
>> if
>> > data is loaded from dataframe.
>> >
>> > In order to overcome from above issues of 2 pass dataloading, we can
>> have
>> > single pass dataloading and following are the alternate solutions.
>> >
>> > Use local dictionary
>> >  Use local dictionary for each carbondata file while loading data, but
>> it
>> > may lead to query performance degradation and more memory footprint.
>> >
>> > Use KV store/distributed map.
>> > *HBase/Cassandra cluster : *
>> >   Dictionary data would be stored in KV store and generates the
>> dictionary
>> > value if it is not present in it. We all know the pros/cons of Hbase
>> but
>> > following are few.
>> >   Pros : These are apache licensed
>> >          Easy to implement to store/retreive dictionary values.
>> >          Performance need to be evaluated.
>> >
>> >   Cons : Need to maintain seperate cluster for maintaining global
>> > dictionary.
>> >
>> > *Hazlecast distributed map : *
>> >   Dictionary data could be saved in distributed concurrent hash map of
>> > hazlecast. It is in-memory map and partioned as per number of nodes.
>> And
>> > even we can maintain the backups using sync/async functionality to
>> avoid
>> > the data loss when instance is down. We no need to maintain seperate
>> > cluster for it as it can run on executor jvm itself.
>> >   Pros: It is apache licensed.
>> >         No need to maintain seperate cluster as instances can run in
>> > executor jvms.
>> >         Easy to implement and store/retreive dictionary values.
>> >         It is pure java implementation.
>> >         There is no master/slave concept and no single point failure.
>> >
>> >   Cons: Performance need to be evaluated.
>> >
>> > *Redis distributed map : *
>> >     It is also in-memory map but it is coded in c language so we should
>> > have java client libraries to interact with redis. Need to maintain
>> > seperate cluster for it. It also can partition the data.
>> >   Pros : More feature rich than Hazlecast.
>> >          Easy to implement and store/retreive dictionary values.
>> >   Cons : Need to maintain seperate cluster for maintaining global
>> > dictionary.
>> >          May not be suitable for big data stack.
>> >          It is BSD licensed (Not sure whether we can use or not)
>> >   Online performance figures says it is little slower than hazlecast.
>> >
>> > Please let me know which would be best fit for our loading solution.
>> And
>> > please add any other suitable solution if I missed.
>> > --
>> > Thanks & Regards,
>> > Ravi
>> >
>>
>
>
>
> --
> Thanks & Regards,
> Ravi





--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

ravipesala
In reply to this post by Aniket Adnaik
Hi,

1. Using the external tool to generate the dictionary : I think It cannot
be default solution, it is just one option to user if they are willing to
generate dictionary separately and provide to carbon while loading the data
to boost performance.

2. Using 2 pass solution(current solution) : Currently we have 2 pass
solution and this becomes bottleneck CarbonOutputFormat. And issues arise
when we use dataframe.write().

3. Using local dictionary as default implementation : we can choose this
solution but it hits query performance as late dictionary decoding cannot
work.

4. Using distributed map as default implementation: Generate the global
dictionary using distributed map solution, but need to evaluate loading
performance.

Regards,
Ravi.

On 14 October 2016 at 06:32, Aniket Adnaik <[hidden email]> wrote:

> After rethinking at point 4 in my previous email;
> It will be very expensive to rebuild and re-encode the values , so may not
> be a viable option. only future loads can benefit from it. But then will
> end up having some segments using global dictionary and some using local
> dictionary. May be we should not consider this option.
>
> Best Regards,
> Aniket
>
>
> On Thu, Oct 13, 2016 at 5:54 PM, Aniket Adnaik <[hidden email]>
> wrote:
>
> > I have following comments;
> >
> > 1. If external dictionary is provided, we accept it. This interface
> should
> > be generic enough, so that we can perform lookup, add, delete, create and
> > drop  functionality. I believe we already have this functionality to some
> > extent. As long as we are able to maintain the dictionary it should be
> fine.
> > 2. If external dictionary is not provided, then by default we should
> build
> > it internally, which is our current behavior.This will continue to impact
> > the load performance though.
> > 3. If load performance is not acceptable, we should allow user to disable
> > building of global dictionary. Carbon should build local dictionary
> > instead. Will this setting apply to all subsequent loads ? may be yes for
> > now.
> > 4. If User decides to build dictionary at later point, either via
> external
> > tool
> > or using carbon sql command ("CREATE  DICTIONARY TABLE...") we should
> > provide that facility. This will help user to improve query performance
> > through late materialization. The local dictionary will not be used in
> this
> > case. Sebsequent loads
> >  will continue to add new entries to this new dictionary (external or
> > carbon specific).
> >
> > This doesn't really solve our double pass problem, but kind of works
> > around it by isolating dictionary building operation out of critical
> path.
> >
> >
> > Best Regards,
> > Aniket
> >
> >
> > On Thu, Oct 13, 2016 at 5:39 PM, Liang Chen <[hidden email]>
> > wrote:
> >
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >> >>>>>the question is what would be the default implementation? Load data
> >> without dictionary?
> >> >
> >> > My thought is we can provide a tool to generate global dictionary
> using
> >> > sample data set, so the initial global dictionaries is available
> before
> >> > normal data loading. We shall be able to perform encoding based on
> that,
> >> > we only need to handle occasionally adding entries while loading. For
> >> > columns specified with global dictionary encoding, but dictionary is
> not
> >> > placed before data loading, we error out and direct user to use the
> tool
> >> > first.
> >> >
> >> > Make sense?
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
> >>
> >> > ravi.pesala@
> >>
> >> > ]
> >> > Sent: Thursday, October 13, 2016 1:12 AM
> >> > To: dev
> >> > Subject: Re: Discussion(New feature) regarding single pass data
> loading
> >> > solution.
> >> >
> >> > Hi Jihong/Aniket,
> >> >
> >> > In the current implementation of carbondata we are already handling
> >> > external dictionary while loading the data.
> >> > But here the question is what would be the default implementation?
> Load
> >> > data with out dictionary?
> >> >
> >> >
> >> > Regards,
> >> > Ravi
> >> >
> >> > On 13 October 2016 at 03:50, Aniket Adnaik &lt;
> >>
> >> > aniket.adnaik@
> >>
> >> > &gt; wrote:
> >> >
> >> >> Hi Ravi,
> >> >>
> >> >> 1. I agree with Jihong that creation of global dictionary should be
> >> >> optional, so that it can be disabled to improve the load performance.
> >> >> User
> >> >> should be made aware that using global dictionary may boost the query
> >> >> performance.
> >> >> 2. We should have a generic interface to manage global dictionary
> when
> >> >> its
> >> >> from external sources. In general, it is not a good idea to depend on
> >> too
> >> >> many external tools.
> >> >> 3. May be we should allow user to generate global dictionary
> separately
> >> >> through SQL command or similar. Something like materialized view.
> This
> >> >> means carbon should avoid using local dictionary and do late
> >> >> materialization when global dictionary is present.
> >> >> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >> >> as we serve SELECT queries. Implementation may not be that straight
> >> >> forward. Not sure if its worth the effort.
> >> >>
> >> >> Best Regards,
> >> >> Aniket
> >> >>
> >> >>
> >> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
> >>
> >> > Jihong.Ma@
> >>
> >> > &gt; wrote:
> >> >>
> >> >> >
> >> >> > A rather straight option is allow user to supply global dictionary
> >> >> > generated somewhere else or we build a separate tool just for
> >> >> generating
> >> >> as
> >> >> > well updating dictionary. Then the general normal data loading
> >> process
> >> >> will
> >> >> > encode columns with local dictionary if not supplied.  This should
> >> >> cover
> >> >> > majority of cases for low-medium cardinality column. For the cases
> we
> >> >> have
> >> >> > to incorporate online dictionary update, use a lock mechanism to
> sync
> >> >> up
> >> >> > should serve the purpose.
> >> >> >
> >> >> > In another words, generating global dictionary is an optional step,
> >> >> only
> >> >> > triggered when needed, not a default step as we do currently.
> >> >> >
> >> >> > Jihong
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Ravindra Pesala [mailto:
> >>
> >> > ravi.pesala@
> >>
> >> > ]
> >> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> >> > To: dev
> >> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> >> > solution.
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > This discussion is regarding single pass data load solution.
> >> >> >
> >> >> > Currently data is loading to carbon in 2 pass/jobs
> >> >> >  1. Generating global dictionary using spark job.
> >> >> >  2. Encode the data with dictionary values and create carbondata
> >> files.
> >> >> > This 2 pass solution has many disadvantages like it needs to read
> the
> >> >> data
> >> >> > twice in case of csv files input or it needs to execute dataframe
> >> twice
> >> >> if
> >> >> > data is loaded from dataframe.
> >> >> >
> >> >> > In order to overcome from above issues of 2 pass dataloading, we
> can
> >> >> have
> >> >> > single pass dataloading and following are the alternate solutions.
> >> >> >
> >> >> > Use local dictionary
> >> >> >  Use local dictionary for each carbondata file while loading data,
> >> but
> >> >> it
> >> >> > may lead to query performance degradation and more memory
> footprint.
> >> >> >
> >> >> > Use KV store/distributed map.
> >> >> > *HBase/Cassandra cluster : *
> >> >> >   Dictionary data would be stored in KV store and generates the
> >> >> dictionary
> >> >> > value if it is not present in it. We all know the pros/cons of
> Hbase
> >> >> but
> >> >> > following are few.
> >> >> >   Pros : These are apache licensed
> >> >> >          Easy to implement to store/retreive dictionary values.
> >> >> >          Performance need to be evaluated.
> >> >> >
> >> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> >> > dictionary.
> >> >> >
> >> >> > *Hazlecast distributed map : *
> >> >> >   Dictionary data could be saved in distributed concurrent hash map
> >> of
> >> >> > hazlecast. It is in-memory map and partioned as per number of
> nodes.
> >> >> And
> >> >> > even we can maintain the backups using sync/async functionality to
> >> >> avoid
> >> >> > the data loss when instance is down. We no need to maintain
> seperate
> >> >> > cluster for it as it can run on executor jvm itself.
> >> >> >   Pros: It is apache licensed.
> >> >> >         No need to maintain seperate cluster as instances can run
> in
> >> >> > executor jvms.
> >> >> >         Easy to implement and store/retreive dictionary values.
> >> >> >         It is pure java implementation.
> >> >> >         There is no master/slave concept and no single point
> failure.
> >> >> >
> >> >> >   Cons: Performance need to be evaluated.
> >> >> >
> >> >> > *Redis distributed map : *
> >> >> >     It is also in-memory map but it is coded in c language so we
> >> should
> >> >> > have java client libraries to interact with redis. Need to maintain
> >> >> > seperate cluster for it. It also can partition the data.
> >> >> >   Pros : More feature rich than Hazlecast.
> >> >> >          Easy to implement and store/retreive dictionary values.
> >> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> >> > dictionary.
> >> >> >          May not be suitable for big data stack.
> >> >> >          It is BSD licensed (Not sure whether we can use or not)
> >> >> >   Online performance figures says it is little slower than
> hazlecast.
> >> >> >
> >> >> > Please let me know which would be best fit for our loading
> solution.
> >> >> And
> >> >> > please add any other suitable solution if I missed.
> >> >> > --
> >> >> > Thanks & Regards,
> >> >> > Ravi
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-maili
> >> ng-list-archive.1130556.n5.nabble.com/Discussion-New-feat
> >> ure-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

ravipesala
In reply to this post by Jihong Ma
Hi Jihong,

I agree, we can use external tool for first load, but for incremental load
we should have solution to add global dictionary. So this solution should
be enough to generate global dictionary even if user does not use external
tool for first time. That solution could be distributed map or KV store.

Regards,
Ravi.

On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:

> Hi Liang,
>
> This tool is more or less like the first load, the first time after table
> is created, any subsequent loads/incremental loads will proceed and is
> capable of updating the global dictionary when it encounters new value,
> this is easiest way of achieving 1 pass data loading process without too
> much overhead.
>
> Since this tool is only triggered once per table, not considered too much
> burden on the end users. Making global dictionary generation out of the way
> of regular data loading is the key here.
>
> Jihong
>
> -----Original Message-----
> From: Liang Chen [mailto:[hidden email]]
> Sent: Thursday, October 13, 2016 5:39 PM
> To: [hidden email]
> Subject: RE: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi jihong
>
> I am not sure that users can accept to use extra tool to do this work,
> because provide tool or do scan at first time per table for most of global
> dict are same cost from users perspective, and maintain the dict file also
> be same cost, they always expecting that system can automatically and
> internally generate dict file during loading data.
>
> Can we consider this:
> first load: make scan to generate most of global dict file, then copy this
> file to each load node for subsequent loading
>
> Regards
> Liang
>
>
> Jihong Ma wrote
> >>>>>the question is what would be the default implementation? Load data
> without dictionary?
> >
> > My thought is we can provide a tool to generate global dictionary using
> > sample data set, so the initial global dictionaries is available before
> > normal data loading. We shall be able to perform encoding based on that,
> > we only need to handle occasionally adding entries while loading. For
> > columns specified with global dictionary encoding, but dictionary is not
> > placed before data loading, we error out and direct user to use the tool
> > first.
> >
> > Make sense?
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> > Sent: Thursday, October 13, 2016 1:12 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi Jihong/Aniket,
> >
> > In the current implementation of carbondata we are already handling
> > external dictionary while loading the data.
> > But here the question is what would be the default implementation? Load
> > data with out dictionary?
> >
> >
> > Regards,
> > Ravi
> >
> > On 13 October 2016 at 03:50, Aniket Adnaik &lt;
>
> > aniket.adnaik@
>
> > &gt; wrote:
> >
> >> Hi Ravi,
> >>
> >> 1. I agree with Jihong that creation of global dictionary should be
> >> optional, so that it can be disabled to improve the load performance.
> >> User
> >> should be made aware that using global dictionary may boost the query
> >> performance.
> >> 2. We should have a generic interface to manage global dictionary when
> >> its
> >> from external sources. In general, it is not a good idea to depend on
> too
> >> many external tools.
> >> 3. May be we should allow user to generate global dictionary separately
> >> through SQL command or similar. Something like materialized view. This
> >> means carbon should avoid using local dictionary and do late
> >> materialization when global dictionary is present.
> >> 4. May be we should think of some ways to create global dictionary
> lazily
> >> as we serve SELECT queries. Implementation may not be that straight
> >> forward. Not sure if its worth the effort.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >>
> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
>
> > Jihong.Ma@
>
> > &gt; wrote:
> >>
> >> >
> >> > A rather straight option is allow user to supply global dictionary
> >> > generated somewhere else or we build a separate tool just for
> >> generating
> >> as
> >> > well updating dictionary. Then the general normal data loading process
> >> will
> >> > encode columns with local dictionary if not supplied.  This should
> >> cover
> >> > majority of cases for low-medium cardinality column. For the cases we
> >> have
> >> > to incorporate online dictionary update, use a lock mechanism to sync
> >> up
> >> > should serve the purpose.
> >> >
> >> > In another words, generating global dictionary is an optional step,
> >> only
> >> > triggered when needed, not a default step as we do currently.
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> > To: dev
> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> > solution.
> >> >
> >> > Hi All,
> >> >
> >> > This discussion is regarding single pass data load solution.
> >> >
> >> > Currently data is loading to carbon in 2 pass/jobs
> >> >  1. Generating global dictionary using spark job.
> >> >  2. Encode the data with dictionary values and create carbondata
> files.
> >> > This 2 pass solution has many disadvantages like it needs to read the
> >> data
> >> > twice in case of csv files input or it needs to execute dataframe
> twice
> >> if
> >> > data is loaded from dataframe.
> >> >
> >> > In order to overcome from above issues of 2 pass dataloading, we can
> >> have
> >> > single pass dataloading and following are the alternate solutions.
> >> >
> >> > Use local dictionary
> >> >  Use local dictionary for each carbondata file while loading data, but
> >> it
> >> > may lead to query performance degradation and more memory footprint.
> >> >
> >> > Use KV store/distributed map.
> >> > *HBase/Cassandra cluster : *
> >> >   Dictionary data would be stored in KV store and generates the
> >> dictionary
> >> > value if it is not present in it. We all know the pros/cons of Hbase
> >> but
> >> > following are few.
> >> >   Pros : These are apache licensed
> >> >          Easy to implement to store/retreive dictionary values.
> >> >          Performance need to be evaluated.
> >> >
> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >
> >> > *Hazlecast distributed map : *
> >> >   Dictionary data could be saved in distributed concurrent hash map of
> >> > hazlecast. It is in-memory map and partioned as per number of nodes.
> >> And
> >> > even we can maintain the backups using sync/async functionality to
> >> avoid
> >> > the data loss when instance is down. We no need to maintain seperate
> >> > cluster for it as it can run on executor jvm itself.
> >> >   Pros: It is apache licensed.
> >> >         No need to maintain seperate cluster as instances can run in
> >> > executor jvms.
> >> >         Easy to implement and store/retreive dictionary values.
> >> >         It is pure java implementation.
> >> >         There is no master/slave concept and no single point failure.
> >> >
> >> >   Cons: Performance need to be evaluated.
> >> >
> >> > *Redis distributed map : *
> >> >     It is also in-memory map but it is coded in c language so we
> should
> >> > have java client libraries to interact with redis. Need to maintain
> >> > seperate cluster for it. It also can partition the data.
> >> >   Pros : More feature rich than Hazlecast.
> >> >          Easy to implement and store/retreive dictionary values.
> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >          May not be suitable for big data stack.
> >> >          It is BSD licensed (Not sure whether we can use or not)
> >> >   Online performance figures says it is little slower than hazlecast.
> >> >
> >> > Please let me know which would be best fit for our loading solution.
> >> And
> >> > please add any other suitable solution if I missed.
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >> >
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Vimal Das Kammath
Global dictionary is the key feature behind Carbon's impressive query
performance as it enables late materialisation enabling carbon to perform
query execution using less memory and CPU resources. It also indirectly
helps carbon to perform better in concurrent query scenarios as any block
can be processed by any node without having to load a local dictionary.

I agree that the 2 pass approach is not the most optimal when it comes to
load performance. I see that we have lot of good alternatives suggested in
this discussion. We need to quantitatively evaluate each of the approaches
to come to a conclusion

1) Support for completely local dictionaries:- This will sure avoid the 2
pass issue. I think that we can have this as an option, but it need not be
the default because the benefit in query performance that we get from the
global dictionary far outweights the performance overhead during data load.
We can check current and future customer scenarios to validate whether
providing this option will benefit any of them. In that case we can
implement this as an optional flag during table creation.

2) Support for External Dictionary:- Current approach copies the externally
supplied dictionary into Carbon's global dictionary. Aniket's suggestion of
providing support for completely external dictionary using an Interface is
a good suggestion. I guess Ravindra's suggestion of using external k/v
stores or distributed maps can be implemented as per this interface. But we
need to test the performance of various distributed maps/key-value stores
and decide whether this is a viable option, because if this approach is
slower than the 2 pass approach, It won't make sense invest in this
approach. However, I support the idea of having an external dictionary
interface.

3) External tool to generate dictionary:- My opinion is that its just
de-coupling the first pass and moving it outside the data load process. But
from the user's perspective, they still need to run the tool first, to
generate the dictionary, before loading data. Our 2 pass approach just
automates this.

Regards
Vimal

On Fri, Oct 14, 2016 at 11:32 PM, Ravindra Pesala <[hidden email]>
wrote:

> Hi Jihong,
>
> I agree, we can use external tool for first load, but for incremental load
> we should have solution to add global dictionary. So this solution should
> be enough to generate global dictionary even if user does not use external
> tool for first time. That solution could be distributed map or KV store.
>
> Regards,
> Ravi.
>
> On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
>
> > Hi Liang,
> >
> > This tool is more or less like the first load, the first time after table
> > is created, any subsequent loads/incremental loads will proceed and is
> > capable of updating the global dictionary when it encounters new value,
> > this is easiest way of achieving 1 pass data loading process without too
> > much overhead.
> >
> > Since this tool is only triggered once per table, not considered too much
> > burden on the end users. Making global dictionary generation out of the
> way
> > of regular data loading is the key here.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Liang Chen [mailto:[hidden email]]
> > Sent: Thursday, October 13, 2016 5:39 PM
> > To: [hidden email]
> > Subject: RE: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi jihong
> >
> > I am not sure that users can accept to use extra tool to do this work,
> > because provide tool or do scan at first time per table for most of
> global
> > dict are same cost from users perspective, and maintain the dict file
> also
> > be same cost, they always expecting that system can automatically and
> > internally generate dict file during loading data.
> >
> > Can we consider this:
> > first load: make scan to generate most of global dict file, then copy
> this
> > file to each load node for subsequent loading
> >
> > Regards
> > Liang
> >
> >
> > Jihong Ma wrote
> > >>>>>the question is what would be the default implementation? Load data
> > without dictionary?
> > >
> > > My thought is we can provide a tool to generate global dictionary using
> > > sample data set, so the initial global dictionaries is available before
> > > normal data loading. We shall be able to perform encoding based on
> that,
> > > we only need to handle occasionally adding entries while loading. For
> > > columns specified with global dictionary encoding, but dictionary is
> not
> > > placed before data loading, we error out and direct user to use the
> tool
> > > first.
> > >
> > > Make sense?
> > >
> > > Jihong
> > >
> > > -----Original Message-----
> > > From: Ravindra Pesala [mailto:
> >
> > > ravi.pesala@
> >
> > > ]
> > > Sent: Thursday, October 13, 2016 1:12 AM
> > > To: dev
> > > Subject: Re: Discussion(New feature) regarding single pass data loading
> > > solution.
> > >
> > > Hi Jihong/Aniket,
> > >
> > > In the current implementation of carbondata we are already handling
> > > external dictionary while loading the data.
> > > But here the question is what would be the default implementation? Load
> > > data with out dictionary?
> > >
> > >
> > > Regards,
> > > Ravi
> > >
> > > On 13 October 2016 at 03:50, Aniket Adnaik &lt;
> >
> > > aniket.adnaik@
> >
> > > &gt; wrote:
> > >
> > >> Hi Ravi,
> > >>
> > >> 1. I agree with Jihong that creation of global dictionary should be
> > >> optional, so that it can be disabled to improve the load performance.
> > >> User
> > >> should be made aware that using global dictionary may boost the query
> > >> performance.
> > >> 2. We should have a generic interface to manage global dictionary when
> > >> its
> > >> from external sources. In general, it is not a good idea to depend on
> > too
> > >> many external tools.
> > >> 3. May be we should allow user to generate global dictionary
> separately
> > >> through SQL command or similar. Something like materialized view. This
> > >> means carbon should avoid using local dictionary and do late
> > >> materialization when global dictionary is present.
> > >> 4. May be we should think of some ways to create global dictionary
> > lazily
> > >> as we serve SELECT queries. Implementation may not be that straight
> > >> forward. Not sure if its worth the effort.
> > >>
> > >> Best Regards,
> > >> Aniket
> > >>
> > >>
> > >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
> >
> > > Jihong.Ma@
> >
> > > &gt; wrote:
> > >>
> > >> >
> > >> > A rather straight option is allow user to supply global dictionary
> > >> > generated somewhere else or we build a separate tool just for
> > >> generating
> > >> as
> > >> > well updating dictionary. Then the general normal data loading
> process
> > >> will
> > >> > encode columns with local dictionary if not supplied.  This should
> > >> cover
> > >> > majority of cases for low-medium cardinality column. For the cases
> we
> > >> have
> > >> > to incorporate online dictionary update, use a lock mechanism to
> sync
> > >> up
> > >> > should serve the purpose.
> > >> >
> > >> > In another words, generating global dictionary is an optional step,
> > >> only
> > >> > triggered when needed, not a default step as we do currently.
> > >> >
> > >> > Jihong
> > >> >
> > >> > -----Original Message-----
> > >> > From: Ravindra Pesala [mailto:
> >
> > > ravi.pesala@
> >
> > > ]
> > >> > Sent: Tuesday, October 11, 2016 2:33 AM
> > >> > To: dev
> > >> > Subject: Discussion(New feature) regarding single pass data loading
> > >> > solution.
> > >> >
> > >> > Hi All,
> > >> >
> > >> > This discussion is regarding single pass data load solution.
> > >> >
> > >> > Currently data is loading to carbon in 2 pass/jobs
> > >> >  1. Generating global dictionary using spark job.
> > >> >  2. Encode the data with dictionary values and create carbondata
> > files.
> > >> > This 2 pass solution has many disadvantages like it needs to read
> the
> > >> data
> > >> > twice in case of csv files input or it needs to execute dataframe
> > twice
> > >> if
> > >> > data is loaded from dataframe.
> > >> >
> > >> > In order to overcome from above issues of 2 pass dataloading, we can
> > >> have
> > >> > single pass dataloading and following are the alternate solutions.
> > >> >
> > >> > Use local dictionary
> > >> >  Use local dictionary for each carbondata file while loading data,
> but
> > >> it
> > >> > may lead to query performance degradation and more memory footprint.
> > >> >
> > >> > Use KV store/distributed map.
> > >> > *HBase/Cassandra cluster : *
> > >> >   Dictionary data would be stored in KV store and generates the
> > >> dictionary
> > >> > value if it is not present in it. We all know the pros/cons of Hbase
> > >> but
> > >> > following are few.
> > >> >   Pros : These are apache licensed
> > >> >          Easy to implement to store/retreive dictionary values.
> > >> >          Performance need to be evaluated.
> > >> >
> > >> >   Cons : Need to maintain seperate cluster for maintaining global
> > >> > dictionary.
> > >> >
> > >> > *Hazlecast distributed map : *
> > >> >   Dictionary data could be saved in distributed concurrent hash map
> of
> > >> > hazlecast. It is in-memory map and partioned as per number of nodes.
> > >> And
> > >> > even we can maintain the backups using sync/async functionality to
> > >> avoid
> > >> > the data loss when instance is down. We no need to maintain seperate
> > >> > cluster for it as it can run on executor jvm itself.
> > >> >   Pros: It is apache licensed.
> > >> >         No need to maintain seperate cluster as instances can run in
> > >> > executor jvms.
> > >> >         Easy to implement and store/retreive dictionary values.
> > >> >         It is pure java implementation.
> > >> >         There is no master/slave concept and no single point
> failure.
> > >> >
> > >> >   Cons: Performance need to be evaluated.
> > >> >
> > >> > *Redis distributed map : *
> > >> >     It is also in-memory map but it is coded in c language so we
> > should
> > >> > have java client libraries to interact with redis. Need to maintain
> > >> > seperate cluster for it. It also can partition the data.
> > >> >   Pros : More feature rich than Hazlecast.
> > >> >          Easy to implement and store/retreive dictionary values.
> > >> >   Cons : Need to maintain seperate cluster for maintaining global
> > >> > dictionary.
> > >> >          May not be suitable for big data stack.
> > >> >          It is BSD licensed (Not sure whether we can use or not)
> > >> >   Online performance figures says it is little slower than
> hazlecast.
> > >> >
> > >> > Please let me know which would be best fit for our loading solution.
> > >> And
> > >> > please add any other suitable solution if I missed.
> > >> > --
> > >> > Thanks & Regards,
> > >> > Ravi
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> > feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Jihong Ma
In reply to this post by ravipesala
Hi Ravi,

The major concern I have for generating global dictionary from scratch with a single scan is performance, the way to handle an occasional update to the dictionary is way simpler and cost effective in terms of synchronization cost and refresh the global/local cache copy.  

There are a lot to worry about for distributed map, and leveraging KV store is overkill if simply just for dictionary generation.

Regards.

Jihong

-----Original Message-----
From: Ravindra Pesala [mailto:[hidden email]]
Sent: Friday, October 14, 2016 11:03 AM
To: dev
Subject: Re: Discussion(New feature) regarding single pass data loading solution.

Hi Jihong,

I agree, we can use external tool for first load, but for incremental load
we should have solution to add global dictionary. So this solution should
be enough to generate global dictionary even if user does not use external
tool for first time. That solution could be distributed map or KV store.

Regards,
Ravi.

On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:

> Hi Liang,
>
> This tool is more or less like the first load, the first time after table
> is created, any subsequent loads/incremental loads will proceed and is
> capable of updating the global dictionary when it encounters new value,
> this is easiest way of achieving 1 pass data loading process without too
> much overhead.
>
> Since this tool is only triggered once per table, not considered too much
> burden on the end users. Making global dictionary generation out of the way
> of regular data loading is the key here.
>
> Jihong
>
> -----Original Message-----
> From: Liang Chen [mailto:[hidden email]]
> Sent: Thursday, October 13, 2016 5:39 PM
> To: [hidden email]
> Subject: RE: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi jihong
>
> I am not sure that users can accept to use extra tool to do this work,
> because provide tool or do scan at first time per table for most of global
> dict are same cost from users perspective, and maintain the dict file also
> be same cost, they always expecting that system can automatically and
> internally generate dict file during loading data.
>
> Can we consider this:
> first load: make scan to generate most of global dict file, then copy this
> file to each load node for subsequent loading
>
> Regards
> Liang
>
>
> Jihong Ma wrote
> >>>>>the question is what would be the default implementation? Load data
> without dictionary?
> >
> > My thought is we can provide a tool to generate global dictionary using
> > sample data set, so the initial global dictionaries is available before
> > normal data loading. We shall be able to perform encoding based on that,
> > we only need to handle occasionally adding entries while loading. For
> > columns specified with global dictionary encoding, but dictionary is not
> > placed before data loading, we error out and direct user to use the tool
> > first.
> >
> > Make sense?
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> > Sent: Thursday, October 13, 2016 1:12 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> >
> > Hi Jihong/Aniket,
> >
> > In the current implementation of carbondata we are already handling
> > external dictionary while loading the data.
> > But here the question is what would be the default implementation? Load
> > data with out dictionary?
> >
> >
> > Regards,
> > Ravi
> >
> > On 13 October 2016 at 03:50, Aniket Adnaik &lt;
>
> > aniket.adnaik@
>
> > &gt; wrote:
> >
> >> Hi Ravi,
> >>
> >> 1. I agree with Jihong that creation of global dictionary should be
> >> optional, so that it can be disabled to improve the load performance.
> >> User
> >> should be made aware that using global dictionary may boost the query
> >> performance.
> >> 2. We should have a generic interface to manage global dictionary when
> >> its
> >> from external sources. In general, it is not a good idea to depend on
> too
> >> many external tools.
> >> 3. May be we should allow user to generate global dictionary separately
> >> through SQL command or similar. Something like materialized view. This
> >> means carbon should avoid using local dictionary and do late
> >> materialization when global dictionary is present.
> >> 4. May be we should think of some ways to create global dictionary
> lazily
> >> as we serve SELECT queries. Implementation may not be that straight
> >> forward. Not sure if its worth the effort.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >>
> >> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
>
> > Jihong.Ma@
>
> > &gt; wrote:
> >>
> >> >
> >> > A rather straight option is allow user to supply global dictionary
> >> > generated somewhere else or we build a separate tool just for
> >> generating
> >> as
> >> > well updating dictionary. Then the general normal data loading process
> >> will
> >> > encode columns with local dictionary if not supplied.  This should
> >> cover
> >> > majority of cases for low-medium cardinality column. For the cases we
> >> have
> >> > to incorporate online dictionary update, use a lock mechanism to sync
> >> up
> >> > should serve the purpose.
> >> >
> >> > In another words, generating global dictionary is an optional step,
> >> only
> >> > triggered when needed, not a default step as we do currently.
> >> >
> >> > Jihong
> >> >
> >> > -----Original Message-----
> >> > From: Ravindra Pesala [mailto:
>
> > ravi.pesala@
>
> > ]
> >> > Sent: Tuesday, October 11, 2016 2:33 AM
> >> > To: dev
> >> > Subject: Discussion(New feature) regarding single pass data loading
> >> > solution.
> >> >
> >> > Hi All,
> >> >
> >> > This discussion is regarding single pass data load solution.
> >> >
> >> > Currently data is loading to carbon in 2 pass/jobs
> >> >  1. Generating global dictionary using spark job.
> >> >  2. Encode the data with dictionary values and create carbondata
> files.
> >> > This 2 pass solution has many disadvantages like it needs to read the
> >> data
> >> > twice in case of csv files input or it needs to execute dataframe
> twice
> >> if
> >> > data is loaded from dataframe.
> >> >
> >> > In order to overcome from above issues of 2 pass dataloading, we can
> >> have
> >> > single pass dataloading and following are the alternate solutions.
> >> >
> >> > Use local dictionary
> >> >  Use local dictionary for each carbondata file while loading data, but
> >> it
> >> > may lead to query performance degradation and more memory footprint.
> >> >
> >> > Use KV store/distributed map.
> >> > *HBase/Cassandra cluster : *
> >> >   Dictionary data would be stored in KV store and generates the
> >> dictionary
> >> > value if it is not present in it. We all know the pros/cons of Hbase
> >> but
> >> > following are few.
> >> >   Pros : These are apache licensed
> >> >          Easy to implement to store/retreive dictionary values.
> >> >          Performance need to be evaluated.
> >> >
> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >
> >> > *Hazlecast distributed map : *
> >> >   Dictionary data could be saved in distributed concurrent hash map of
> >> > hazlecast. It is in-memory map and partioned as per number of nodes.
> >> And
> >> > even we can maintain the backups using sync/async functionality to
> >> avoid
> >> > the data loss when instance is down. We no need to maintain seperate
> >> > cluster for it as it can run on executor jvm itself.
> >> >   Pros: It is apache licensed.
> >> >         No need to maintain seperate cluster as instances can run in
> >> > executor jvms.
> >> >         Easy to implement and store/retreive dictionary values.
> >> >         It is pure java implementation.
> >> >         There is no master/slave concept and no single point failure.
> >> >
> >> >   Cons: Performance need to be evaluated.
> >> >
> >> > *Redis distributed map : *
> >> >     It is also in-memory map but it is coded in c language so we
> should
> >> > have java client libraries to interact with redis. Need to maintain
> >> > seperate cluster for it. It also can partition the data.
> >> >   Pros : More feature rich than Hazlecast.
> >> >          Easy to implement and store/retreive dictionary values.
> >> >   Cons : Need to maintain seperate cluster for maintaining global
> >> > dictionary.
> >> >          May not be suitable for big data stack.
> >> >          It is BSD licensed (Not sure whether we can use or not)
> >> >   Online performance figures says it is little slower than hazlecast.
> >> >
> >> > Please let me know which would be best fit for our loading solution.
> >> And
> >> > please add any other suitable solution if I missed.
> >> > --
> >> > Thanks & Regards,
> >> > Ravi
> >> >
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Jacky Li
Hi,

I can offer one more approach for this discussion, since new dictionary values are rare in case of incremental load (ensure first load having as much dictionary value as possible), so synchronization should be rare. So how about using Zookeeper + HDFS file to provide this service. This is what carbon is doing today, we can wrap Zookeeper + HDFS to provide the global dictionary interface.
It has the benefit of
1. automated: without bordering the user
2. not introducing more dependency: we already using zookeeper and HDFS.
3. performance? since new dictionary value and synchronization is rare.

What do you think?

Regards,
Jacky

> 在 2016年10月15日,上午2:38,Jihong Ma <[hidden email]> 写道:
>
> Hi Ravi,
>
> The major concern I have for generating global dictionary from scratch with a single scan is performance, the way to handle an occasional update to the dictionary is way simpler and cost effective in terms of synchronization cost and refresh the global/local cache copy.  
>
> There are a lot to worry about for distributed map, and leveraging KV store is overkill if simply just for dictionary generation.
>
> Regards.
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:[hidden email]]
> Sent: Friday, October 14, 2016 11:03 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading solution.
>
> Hi Jihong,
>
> I agree, we can use external tool for first load, but for incremental load
> we should have solution to add global dictionary. So this solution should
> be enough to generate global dictionary even if user does not use external
> tool for first time. That solution could be distributed map or KV store.
>
> Regards,
> Ravi.
>
> On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
>
>> Hi Liang,
>>
>> This tool is more or less like the first load, the first time after table
>> is created, any subsequent loads/incremental loads will proceed and is
>> capable of updating the global dictionary when it encounters new value,
>> this is easiest way of achieving 1 pass data loading process without too
>> much overhead.
>>
>> Since this tool is only triggered once per table, not considered too much
>> burden on the end users. Making global dictionary generation out of the way
>> of regular data loading is the key here.
>>
>> Jihong
>>
>> -----Original Message-----
>> From: Liang Chen [mailto:[hidden email]]
>> Sent: Thursday, October 13, 2016 5:39 PM
>> To: [hidden email]
>> Subject: RE: Discussion(New feature) regarding single pass data loading
>> solution.
>>
>> Hi jihong
>>
>> I am not sure that users can accept to use extra tool to do this work,
>> because provide tool or do scan at first time per table for most of global
>> dict are same cost from users perspective, and maintain the dict file also
>> be same cost, they always expecting that system can automatically and
>> internally generate dict file during loading data.
>>
>> Can we consider this:
>> first load: make scan to generate most of global dict file, then copy this
>> file to each load node for subsequent loading
>>
>> Regards
>> Liang
>>
>>
>> Jihong Ma wrote
>>>>>>> the question is what would be the default implementation? Load data
>> without dictionary?
>>>
>>> My thought is we can provide a tool to generate global dictionary using
>>> sample data set, so the initial global dictionaries is available before
>>> normal data loading. We shall be able to perform encoding based on that,
>>> we only need to handle occasionally adding entries while loading. For
>>> columns specified with global dictionary encoding, but dictionary is not
>>> placed before data loading, we error out and direct user to use the tool
>>> first.
>>>
>>> Make sense?
>>>
>>> Jihong
>>>
>>> -----Original Message-----
>>> From: Ravindra Pesala [mailto:
>>
>>> ravi.pesala@
>>
>>> ]
>>> Sent: Thursday, October 13, 2016 1:12 AM
>>> To: dev
>>> Subject: Re: Discussion(New feature) regarding single pass data loading
>>> solution.
>>>
>>> Hi Jihong/Aniket,
>>>
>>> In the current implementation of carbondata we are already handling
>>> external dictionary while loading the data.
>>> But here the question is what would be the default implementation? Load
>>> data with out dictionary?
>>>
>>>
>>> Regards,
>>> Ravi
>>>
>>> On 13 October 2016 at 03:50, Aniket Adnaik &lt;
>>
>>> aniket.adnaik@
>>
>>> &gt; wrote:
>>>
>>>> Hi Ravi,
>>>>
>>>> 1. I agree with Jihong that creation of global dictionary should be
>>>> optional, so that it can be disabled to improve the load performance.
>>>> User
>>>> should be made aware that using global dictionary may boost the query
>>>> performance.
>>>> 2. We should have a generic interface to manage global dictionary when
>>>> its
>>>> from external sources. In general, it is not a good idea to depend on
>> too
>>>> many external tools.
>>>> 3. May be we should allow user to generate global dictionary separately
>>>> through SQL command or similar. Something like materialized view. This
>>>> means carbon should avoid using local dictionary and do late
>>>> materialization when global dictionary is present.
>>>> 4. May be we should think of some ways to create global dictionary
>> lazily
>>>> as we serve SELECT queries. Implementation may not be that straight
>>>> forward. Not sure if its worth the effort.
>>>>
>>>> Best Regards,
>>>> Aniket
>>>>
>>>>
>>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
>>
>>> Jihong.Ma@
>>
>>> &gt; wrote:
>>>>
>>>>>
>>>>> A rather straight option is allow user to supply global dictionary
>>>>> generated somewhere else or we build a separate tool just for
>>>> generating
>>>> as
>>>>> well updating dictionary. Then the general normal data loading process
>>>> will
>>>>> encode columns with local dictionary if not supplied.  This should
>>>> cover
>>>>> majority of cases for low-medium cardinality column. For the cases we
>>>> have
>>>>> to incorporate online dictionary update, use a lock mechanism to sync
>>>> up
>>>>> should serve the purpose.
>>>>>
>>>>> In another words, generating global dictionary is an optional step,
>>>> only
>>>>> triggered when needed, not a default step as we do currently.
>>>>>
>>>>> Jihong
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ravindra Pesala [mailto:
>>
>>> ravi.pesala@
>>
>>> ]
>>>>> Sent: Tuesday, October 11, 2016 2:33 AM
>>>>> To: dev
>>>>> Subject: Discussion(New feature) regarding single pass data loading
>>>>> solution.
>>>>>
>>>>> Hi All,
>>>>>
>>>>> This discussion is regarding single pass data load solution.
>>>>>
>>>>> Currently data is loading to carbon in 2 pass/jobs
>>>>> 1. Generating global dictionary using spark job.
>>>>> 2. Encode the data with dictionary values and create carbondata
>> files.
>>>>> This 2 pass solution has many disadvantages like it needs to read the
>>>> data
>>>>> twice in case of csv files input or it needs to execute dataframe
>> twice
>>>> if
>>>>> data is loaded from dataframe.
>>>>>
>>>>> In order to overcome from above issues of 2 pass dataloading, we can
>>>> have
>>>>> single pass dataloading and following are the alternate solutions.
>>>>>
>>>>> Use local dictionary
>>>>> Use local dictionary for each carbondata file while loading data, but
>>>> it
>>>>> may lead to query performance degradation and more memory footprint.
>>>>>
>>>>> Use KV store/distributed map.
>>>>> *HBase/Cassandra cluster : *
>>>>>  Dictionary data would be stored in KV store and generates the
>>>> dictionary
>>>>> value if it is not present in it. We all know the pros/cons of Hbase
>>>> but
>>>>> following are few.
>>>>>  Pros : These are apache licensed
>>>>>         Easy to implement to store/retreive dictionary values.
>>>>>         Performance need to be evaluated.
>>>>>
>>>>>  Cons : Need to maintain seperate cluster for maintaining global
>>>>> dictionary.
>>>>>
>>>>> *Hazlecast distributed map : *
>>>>>  Dictionary data could be saved in distributed concurrent hash map of
>>>>> hazlecast. It is in-memory map and partioned as per number of nodes.
>>>> And
>>>>> even we can maintain the backups using sync/async functionality to
>>>> avoid
>>>>> the data loss when instance is down. We no need to maintain seperate
>>>>> cluster for it as it can run on executor jvm itself.
>>>>>  Pros: It is apache licensed.
>>>>>        No need to maintain seperate cluster as instances can run in
>>>>> executor jvms.
>>>>>        Easy to implement and store/retreive dictionary values.
>>>>>        It is pure java implementation.
>>>>>        There is no master/slave concept and no single point failure.
>>>>>
>>>>>  Cons: Performance need to be evaluated.
>>>>>
>>>>> *Redis distributed map : *
>>>>>    It is also in-memory map but it is coded in c language so we
>> should
>>>>> have java client libraries to interact with redis. Need to maintain
>>>>> seperate cluster for it. It also can partition the data.
>>>>>  Pros : More feature rich than Hazlecast.
>>>>>         Easy to implement and store/retreive dictionary values.
>>>>>  Cons : Need to maintain seperate cluster for maintaining global
>>>>> dictionary.
>>>>>         May not be suitable for big data stack.
>>>>>         It is BSD licensed (Not sure whether we can use or not)
>>>>>  Online performance figures says it is little slower than hazlecast.
>>>>>
>>>>> Please let me know which would be best fit for our loading solution.
>>>> And
>>>>> please add any other suitable solution if I missed.
>>>>> --
>>>>> Thanks & Regards,
>>>>> Ravi
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ravi
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-carbondata-
>> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
>> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
>> Sent from the Apache CarbonData Mailing List archive mailing list archive
>> at Nabble.com.
>>
>
>
>
> --
> Thanks & Regards,
> Ravi



Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

ravipesala
Hi Jacky/Jihong,

I agree that new dictionary values are less in case of incremental data
load but that is completely depends on user data scenarios.  In some
user scenarios new dictionary values may be more we cannot overrule that.
And also for users convenience we should provide single pass solution with
out insisting them to run external tool first. We can provide the option to
run external tool first and provide dictionary to improve performance.

My opinion is better to use some professional distributed map like
Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
have separate cluster, it can form the cluster within the executor jvm's .
May be we can have a try, after all it will be just one interface
implementation for dictionary generation. We can have multiple
implementations and then decide based on optimal performance.

Regards,
Ravi

On 15 October 2016 at 10:50, Jacky Li <[hidden email]> wrote:

> Hi,
>
> I can offer one more approach for this discussion, since new dictionary
> values are rare in case of incremental load (ensure first load having as
> much dictionary value as possible), so synchronization should be rare. So
> how about using Zookeeper + HDFS file to provide this service. This is what
> carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> dictionary interface.
> It has the benefit of
> 1. automated: without bordering the user
> 2. not introducing more dependency: we already using zookeeper and HDFS.
> 3. performance? since new dictionary value and synchronization is rare.
>
> What do you think?
>
> Regards,
> Jacky
>
> > 在 2016年10月15日,上午2:38,Jihong Ma <[hidden email]> 写道:
> >
> > Hi Ravi,
> >
> > The major concern I have for generating global dictionary from scratch
> with a single scan is performance, the way to handle an occasional update
> to the dictionary is way simpler and cost effective in terms of
> synchronization cost and refresh the global/local cache copy.
> >
> > There are a lot to worry about for distributed map, and leveraging KV
> store is overkill if simply just for dictionary generation.
> >
> > Regards.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:[hidden email]]
> > Sent: Friday, October 14, 2016 11:03 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
> >
> > Hi Jihong,
> >
> > I agree, we can use external tool for first load, but for incremental
> load
> > we should have solution to add global dictionary. So this solution should
> > be enough to generate global dictionary even if user does not use
> external
> > tool for first time. That solution could be distributed map or KV store.
> >
> > Regards,
> > Ravi.
> >
> > On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
> >
> >> Hi Liang,
> >>
> >> This tool is more or less like the first load, the first time after
> table
> >> is created, any subsequent loads/incremental loads will proceed and is
> >> capable of updating the global dictionary when it encounters new value,
> >> this is easiest way of achieving 1 pass data loading process without too
> >> much overhead.
> >>
> >> Since this tool is only triggered once per table, not considered too
> much
> >> burden on the end users. Making global dictionary generation out of the
> way
> >> of regular data loading is the key here.
> >>
> >> Jihong
> >>
> >> -----Original Message-----
> >> From: Liang Chen [mailto:[hidden email]]
> >> Sent: Thursday, October 13, 2016 5:39 PM
> >> To: [hidden email]
> >> Subject: RE: Discussion(New feature) regarding single pass data loading
> >> solution.
> >>
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >>>>>>> the question is what would be the default implementation? Load data
> >> without dictionary?
> >>>
> >>> My thought is we can provide a tool to generate global dictionary using
> >>> sample data set, so the initial global dictionaries is available before
> >>> normal data loading. We shall be able to perform encoding based on
> that,
> >>> we only need to handle occasionally adding entries while loading. For
> >>> columns specified with global dictionary encoding, but dictionary is
> not
> >>> placed before data loading, we error out and direct user to use the
> tool
> >>> first.
> >>>
> >>> Make sense?
> >>>
> >>> Jihong
> >>>
> >>> -----Original Message-----
> >>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>> Sent: Thursday, October 13, 2016 1:12 AM
> >>> To: dev
> >>> Subject: Re: Discussion(New feature) regarding single pass data loading
> >>> solution.
> >>>
> >>> Hi Jihong/Aniket,
> >>>
> >>> In the current implementation of carbondata we are already handling
> >>> external dictionary while loading the data.
> >>> But here the question is what would be the default implementation? Load
> >>> data with out dictionary?
> >>>
> >>>
> >>> Regards,
> >>> Ravi
> >>>
> >>> On 13 October 2016 at 03:50, Aniket Adnaik &lt;
> >>
> >>> aniket.adnaik@
> >>
> >>> &gt; wrote:
> >>>
> >>>> Hi Ravi,
> >>>>
> >>>> 1. I agree with Jihong that creation of global dictionary should be
> >>>> optional, so that it can be disabled to improve the load performance.
> >>>> User
> >>>> should be made aware that using global dictionary may boost the query
> >>>> performance.
> >>>> 2. We should have a generic interface to manage global dictionary when
> >>>> its
> >>>> from external sources. In general, it is not a good idea to depend on
> >> too
> >>>> many external tools.
> >>>> 3. May be we should allow user to generate global dictionary
> separately
> >>>> through SQL command or similar. Something like materialized view. This
> >>>> means carbon should avoid using local dictionary and do late
> >>>> materialization when global dictionary is present.
> >>>> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >>>> as we serve SELECT queries. Implementation may not be that straight
> >>>> forward. Not sure if its worth the effort.
> >>>>
> >>>> Best Regards,
> >>>> Aniket
> >>>>
> >>>>
> >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
> >>
> >>> Jihong.Ma@
> >>
> >>> &gt; wrote:
> >>>>
> >>>>>
> >>>>> A rather straight option is allow user to supply global dictionary
> >>>>> generated somewhere else or we build a separate tool just for
> >>>> generating
> >>>> as
> >>>>> well updating dictionary. Then the general normal data loading
> process
> >>>> will
> >>>>> encode columns with local dictionary if not supplied.  This should
> >>>> cover
> >>>>> majority of cases for low-medium cardinality column. For the cases we
> >>>> have
> >>>>> to incorporate online dictionary update, use a lock mechanism to sync
> >>>> up
> >>>>> should serve the purpose.
> >>>>>
> >>>>> In another words, generating global dictionary is an optional step,
> >>>> only
> >>>>> triggered when needed, not a default step as we do currently.
> >>>>>
> >>>>> Jihong
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> >>>>> To: dev
> >>>>> Subject: Discussion(New feature) regarding single pass data loading
> >>>>> solution.
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This discussion is regarding single pass data load solution.
> >>>>>
> >>>>> Currently data is loading to carbon in 2 pass/jobs
> >>>>> 1. Generating global dictionary using spark job.
> >>>>> 2. Encode the data with dictionary values and create carbondata
> >> files.
> >>>>> This 2 pass solution has many disadvantages like it needs to read the
> >>>> data
> >>>>> twice in case of csv files input or it needs to execute dataframe
> >> twice
> >>>> if
> >>>>> data is loaded from dataframe.
> >>>>>
> >>>>> In order to overcome from above issues of 2 pass dataloading, we can
> >>>> have
> >>>>> single pass dataloading and following are the alternate solutions.
> >>>>>
> >>>>> Use local dictionary
> >>>>> Use local dictionary for each carbondata file while loading data, but
> >>>> it
> >>>>> may lead to query performance degradation and more memory footprint.
> >>>>>
> >>>>> Use KV store/distributed map.
> >>>>> *HBase/Cassandra cluster : *
> >>>>>  Dictionary data would be stored in KV store and generates the
> >>>> dictionary
> >>>>> value if it is not present in it. We all know the pros/cons of Hbase
> >>>> but
> >>>>> following are few.
> >>>>>  Pros : These are apache licensed
> >>>>>         Easy to implement to store/retreive dictionary values.
> >>>>>         Performance need to be evaluated.
> >>>>>
> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>
> >>>>> *Hazlecast distributed map : *
> >>>>>  Dictionary data could be saved in distributed concurrent hash map of
> >>>>> hazlecast. It is in-memory map and partioned as per number of nodes.
> >>>> And
> >>>>> even we can maintain the backups using sync/async functionality to
> >>>> avoid
> >>>>> the data loss when instance is down. We no need to maintain seperate
> >>>>> cluster for it as it can run on executor jvm itself.
> >>>>>  Pros: It is apache licensed.
> >>>>>        No need to maintain seperate cluster as instances can run in
> >>>>> executor jvms.
> >>>>>        Easy to implement and store/retreive dictionary values.
> >>>>>        It is pure java implementation.
> >>>>>        There is no master/slave concept and no single point failure.
> >>>>>
> >>>>>  Cons: Performance need to be evaluated.
> >>>>>
> >>>>> *Redis distributed map : *
> >>>>>    It is also in-memory map but it is coded in c language so we
> >> should
> >>>>> have java client libraries to interact with redis. Need to maintain
> >>>>> seperate cluster for it. It also can partition the data.
> >>>>>  Pros : More feature rich than Hazlecast.
> >>>>>         Easy to implement and store/retreive dictionary values.
> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>         May not be suitable for big data stack.
> >>>>>         It is BSD licensed (Not sure whether we can use or not)
> >>>>>  Online performance figures says it is little slower than hazlecast.
> >>>>>
> >>>>> Please let me know which would be best fit for our loading solution.
> >>>> And
> >>>>> please add any other suitable solution if I missed.
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Ravi
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-
> >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

Liang Chen
Administrator
+1 for this :
-------------------------
May be we can have a try, after all it will be just one interface
implementation for dictionary generation. We can have multiple
implementations and then decide based on optimal performance.

Regards
Liang

Ravindra Pesala wrote
Hi Jacky/Jihong,

I agree that new dictionary values are less in case of incremental data
load but that is completely depends on user data scenarios.  In some
user scenarios new dictionary values may be more we cannot overrule that.
And also for users convenience we should provide single pass solution with
out insisting them to run external tool first. We can provide the option to
run external tool first and provide dictionary to improve performance.

My opinion is better to use some professional distributed map like
Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
have separate cluster, it can form the cluster within the executor jvm's .
May be we can have a try, after all it will be just one interface
implementation for dictionary generation. We can have multiple
implementations and then decide based on optimal performance.

Regards,
Ravi

On 15 October 2016 at 10:50, Jacky Li <[hidden email]> wrote:

> Hi,
>
> I can offer one more approach for this discussion, since new dictionary
> values are rare in case of incremental load (ensure first load having as
> much dictionary value as possible), so synchronization should be rare. So
> how about using Zookeeper + HDFS file to provide this service. This is what
> carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> dictionary interface.
> It has the benefit of
> 1. automated: without bordering the user
> 2. not introducing more dependency: we already using zookeeper and HDFS.
> 3. performance? since new dictionary value and synchronization is rare.
>
> What do you think?
>
> Regards,
> Jacky
>
> > 在 2016年10月15日,上午2:38,Jihong Ma <[hidden email]> 写道:
> >
> > Hi Ravi,
> >
> > The major concern I have for generating global dictionary from scratch
> with a single scan is performance, the way to handle an occasional update
> to the dictionary is way simpler and cost effective in terms of
> synchronization cost and refresh the global/local cache copy.
> >
> > There are a lot to worry about for distributed map, and leveraging KV
> store is overkill if simply just for dictionary generation.
> >
> > Regards.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:[hidden email]]
> > Sent: Friday, October 14, 2016 11:03 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
> >
> > Hi Jihong,
> >
> > I agree, we can use external tool for first load, but for incremental
> load
> > we should have solution to add global dictionary. So this solution should
> > be enough to generate global dictionary even if user does not use
> external
> > tool for first time. That solution could be distributed map or KV store.
> >
> > Regards,
> > Ravi.
> >
> > On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
> >
> >> Hi Liang,
> >>
> >> This tool is more or less like the first load, the first time after
> table
> >> is created, any subsequent loads/incremental loads will proceed and is
> >> capable of updating the global dictionary when it encounters new value,
> >> this is easiest way of achieving 1 pass data loading process without too
> >> much overhead.
> >>
> >> Since this tool is only triggered once per table, not considered too
> much
> >> burden on the end users. Making global dictionary generation out of the
> way
> >> of regular data loading is the key here.
> >>
> >> Jihong
> >>
> >> -----Original Message-----
> >> From: Liang Chen [mailto:[hidden email]]
> >> Sent: Thursday, October 13, 2016 5:39 PM
> >> To: [hidden email]
> >> Subject: RE: Discussion(New feature) regarding single pass data loading
> >> solution.
> >>
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >>>>>>> the question is what would be the default implementation? Load data
> >> without dictionary?
> >>>
> >>> My thought is we can provide a tool to generate global dictionary using
> >>> sample data set, so the initial global dictionaries is available before
> >>> normal data loading. We shall be able to perform encoding based on
> that,
> >>> we only need to handle occasionally adding entries while loading. For
> >>> columns specified with global dictionary encoding, but dictionary is
> not
> >>> placed before data loading, we error out and direct user to use the
> tool
> >>> first.
> >>>
> >>> Make sense?
> >>>
> >>> Jihong
> >>>
> >>> -----Original Message-----
> >>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>> Sent: Thursday, October 13, 2016 1:12 AM
> >>> To: dev
> >>> Subject: Re: Discussion(New feature) regarding single pass data loading
> >>> solution.
> >>>
> >>> Hi Jihong/Aniket,
> >>>
> >>> In the current implementation of carbondata we are already handling
> >>> external dictionary while loading the data.
> >>> But here the question is what would be the default implementation? Load
> >>> data with out dictionary?
> >>>
> >>>
> >>> Regards,
> >>> Ravi
> >>>
> >>> On 13 October 2016 at 03:50, Aniket Adnaik <
> >>
> >>> aniket.adnaik@
> >>
> >>> > wrote:
> >>>
> >>>> Hi Ravi,
> >>>>
> >>>> 1. I agree with Jihong that creation of global dictionary should be
> >>>> optional, so that it can be disabled to improve the load performance.
> >>>> User
> >>>> should be made aware that using global dictionary may boost the query
> >>>> performance.
> >>>> 2. We should have a generic interface to manage global dictionary when
> >>>> its
> >>>> from external sources. In general, it is not a good idea to depend on
> >> too
> >>>> many external tools.
> >>>> 3. May be we should allow user to generate global dictionary
> separately
> >>>> through SQL command or similar. Something like materialized view. This
> >>>> means carbon should avoid using local dictionary and do late
> >>>> materialization when global dictionary is present.
> >>>> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >>>> as we serve SELECT queries. Implementation may not be that straight
> >>>> forward. Not sure if its worth the effort.
> >>>>
> >>>> Best Regards,
> >>>> Aniket
> >>>>
> >>>>
> >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma <
> >>
> >>> Jihong.Ma@
> >>
> >>> > wrote:
> >>>>
> >>>>>
> >>>>> A rather straight option is allow user to supply global dictionary
> >>>>> generated somewhere else or we build a separate tool just for
> >>>> generating
> >>>> as
> >>>>> well updating dictionary. Then the general normal data loading
> process
> >>>> will
> >>>>> encode columns with local dictionary if not supplied.  This should
> >>>> cover
> >>>>> majority of cases for low-medium cardinality column. For the cases we
> >>>> have
> >>>>> to incorporate online dictionary update, use a lock mechanism to sync
> >>>> up
> >>>>> should serve the purpose.
> >>>>>
> >>>>> In another words, generating global dictionary is an optional step,
> >>>> only
> >>>>> triggered when needed, not a default step as we do currently.
> >>>>>
> >>>>> Jihong
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> >>>>> To: dev
> >>>>> Subject: Discussion(New feature) regarding single pass data loading
> >>>>> solution.
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This discussion is regarding single pass data load solution.
> >>>>>
> >>>>> Currently data is loading to carbon in 2 pass/jobs
> >>>>> 1. Generating global dictionary using spark job.
> >>>>> 2. Encode the data with dictionary values and create carbondata
> >> files.
> >>>>> This 2 pass solution has many disadvantages like it needs to read the
> >>>> data
> >>>>> twice in case of csv files input or it needs to execute dataframe
> >> twice
> >>>> if
> >>>>> data is loaded from dataframe.
> >>>>>
> >>>>> In order to overcome from above issues of 2 pass dataloading, we can
> >>>> have
> >>>>> single pass dataloading and following are the alternate solutions.
> >>>>>
> >>>>> Use local dictionary
> >>>>> Use local dictionary for each carbondata file while loading data, but
> >>>> it
> >>>>> may lead to query performance degradation and more memory footprint.
> >>>>>
> >>>>> Use KV store/distributed map.
> >>>>> *HBase/Cassandra cluster : *
> >>>>>  Dictionary data would be stored in KV store and generates the
> >>>> dictionary
> >>>>> value if it is not present in it. We all know the pros/cons of Hbase
> >>>> but
> >>>>> following are few.
> >>>>>  Pros : These are apache licensed
> >>>>>         Easy to implement to store/retreive dictionary values.
> >>>>>         Performance need to be evaluated.
> >>>>>
> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>
> >>>>> *Hazlecast distributed map : *
> >>>>>  Dictionary data could be saved in distributed concurrent hash map of
> >>>>> hazlecast. It is in-memory map and partioned as per number of nodes.
> >>>> And
> >>>>> even we can maintain the backups using sync/async functionality to
> >>>> avoid
> >>>>> the data loss when instance is down. We no need to maintain seperate
> >>>>> cluster for it as it can run on executor jvm itself.
> >>>>>  Pros: It is apache licensed.
> >>>>>        No need to maintain seperate cluster as instances can run in
> >>>>> executor jvms.
> >>>>>        Easy to implement and store/retreive dictionary values.
> >>>>>        It is pure java implementation.
> >>>>>        There is no master/slave concept and no single point failure.
> >>>>>
> >>>>>  Cons: Performance need to be evaluated.
> >>>>>
> >>>>> *Redis distributed map : *
> >>>>>    It is also in-memory map but it is coded in c language so we
> >> should
> >>>>> have java client libraries to interact with redis. Need to maintain
> >>>>> seperate cluster for it. It also can partition the data.
> >>>>>  Pros : More feature rich than Hazlecast.
> >>>>>         Easy to implement and store/retreive dictionary values.
> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>         May not be suitable for big data stack.
> >>>>>         It is BSD licensed (Not sure whether we can use or not)
> >>>>>  Online performance figures says it is little slower than hazlecast.
> >>>>>
> >>>>> Please let me know which would be best fit for our loading solution.
> >>>> And
> >>>>> please add any other suitable solution if I missed.
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Ravi
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-
> >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Jihong Ma
In reply to this post by ravipesala
Hi Ravi,

I took a quick look at Hazlecast, what they offer is a distributed map across cluster (on any single node only portion of the map is stored), to facilitate parallel data loading I think we need a complete copy on each node, is this the structure we are looking for?

it does allow map in-memory backup in case one node goes down, to ensure its persistency, they allow storing map to db, but requires implementing their API to hook them up, there are async/ sync mode supported with no guarantee in terms of consistency, unless going further for a transaction support, 2-phase commit/XA are offered with read-committed isolation, to achieve that is quite complicated when we need to ensure ACID on changes to the map. I suggest you to investigate further to understand the implication and effort.  

We all understand We couldn't afford any inconsistency on dictionary, that means we couldn't decode the data back correctly. correctness is even more critical compared to performance.


Jihong

-----Original Message-----
From: Ravindra Pesala [mailto:[hidden email]]
Sent: Saturday, October 15, 2016 12:50 AM
To: dev
Subject: Re: Discussion(New feature) regarding single pass data loading solution.

Hi Jacky/Jihong,

I agree that new dictionary values are less in case of incremental data
load but that is completely depends on user data scenarios.  In some
user scenarios new dictionary values may be more we cannot overrule that.
And also for users convenience we should provide single pass solution with
out insisting them to run external tool first. We can provide the option to
run external tool first and provide dictionary to improve performance.

My opinion is better to use some professional distributed map like
Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
have separate cluster, it can form the cluster within the executor jvm's .
May be we can have a try, after all it will be just one interface
implementation for dictionary generation. We can have multiple
implementations and then decide based on optimal performance.

Regards,
Ravi

On 15 October 2016 at 10:50, Jacky Li <[hidden email]> wrote:

> Hi,
>
> I can offer one more approach for this discussion, since new dictionary
> values are rare in case of incremental load (ensure first load having as
> much dictionary value as possible), so synchronization should be rare. So
> how about using Zookeeper + HDFS file to provide this service. This is what
> carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> dictionary interface.
> It has the benefit of
> 1. automated: without bordering the user
> 2. not introducing more dependency: we already using zookeeper and HDFS.
> 3. performance? since new dictionary value and synchronization is rare.
>
> What do you think?
>
> Regards,
> Jacky
>
> > 在 2016年10月15日,上午2:38,Jihong Ma <[hidden email]> 写道:
> >
> > Hi Ravi,
> >
> > The major concern I have for generating global dictionary from scratch
> with a single scan is performance, the way to handle an occasional update
> to the dictionary is way simpler and cost effective in terms of
> synchronization cost and refresh the global/local cache copy.
> >
> > There are a lot to worry about for distributed map, and leveraging KV
> store is overkill if simply just for dictionary generation.
> >
> > Regards.
> >
> > Jihong
> >
> > -----Original Message-----
> > From: Ravindra Pesala [mailto:[hidden email]]
> > Sent: Friday, October 14, 2016 11:03 AM
> > To: dev
> > Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
> >
> > Hi Jihong,
> >
> > I agree, we can use external tool for first load, but for incremental
> load
> > we should have solution to add global dictionary. So this solution should
> > be enough to generate global dictionary even if user does not use
> external
> > tool for first time. That solution could be distributed map or KV store.
> >
> > Regards,
> > Ravi.
> >
> > On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
> >
> >> Hi Liang,
> >>
> >> This tool is more or less like the first load, the first time after
> table
> >> is created, any subsequent loads/incremental loads will proceed and is
> >> capable of updating the global dictionary when it encounters new value,
> >> this is easiest way of achieving 1 pass data loading process without too
> >> much overhead.
> >>
> >> Since this tool is only triggered once per table, not considered too
> much
> >> burden on the end users. Making global dictionary generation out of the
> way
> >> of regular data loading is the key here.
> >>
> >> Jihong
> >>
> >> -----Original Message-----
> >> From: Liang Chen [mailto:[hidden email]]
> >> Sent: Thursday, October 13, 2016 5:39 PM
> >> To: [hidden email]
> >> Subject: RE: Discussion(New feature) regarding single pass data loading
> >> solution.
> >>
> >> Hi jihong
> >>
> >> I am not sure that users can accept to use extra tool to do this work,
> >> because provide tool or do scan at first time per table for most of
> global
> >> dict are same cost from users perspective, and maintain the dict file
> also
> >> be same cost, they always expecting that system can automatically and
> >> internally generate dict file during loading data.
> >>
> >> Can we consider this:
> >> first load: make scan to generate most of global dict file, then copy
> this
> >> file to each load node for subsequent loading
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jihong Ma wrote
> >>>>>>> the question is what would be the default implementation? Load data
> >> without dictionary?
> >>>
> >>> My thought is we can provide a tool to generate global dictionary using
> >>> sample data set, so the initial global dictionaries is available before
> >>> normal data loading. We shall be able to perform encoding based on
> that,
> >>> we only need to handle occasionally adding entries while loading. For
> >>> columns specified with global dictionary encoding, but dictionary is
> not
> >>> placed before data loading, we error out and direct user to use the
> tool
> >>> first.
> >>>
> >>> Make sense?
> >>>
> >>> Jihong
> >>>
> >>> -----Original Message-----
> >>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>> Sent: Thursday, October 13, 2016 1:12 AM
> >>> To: dev
> >>> Subject: Re: Discussion(New feature) regarding single pass data loading
> >>> solution.
> >>>
> >>> Hi Jihong/Aniket,
> >>>
> >>> In the current implementation of carbondata we are already handling
> >>> external dictionary while loading the data.
> >>> But here the question is what would be the default implementation? Load
> >>> data with out dictionary?
> >>>
> >>>
> >>> Regards,
> >>> Ravi
> >>>
> >>> On 13 October 2016 at 03:50, Aniket Adnaik &lt;
> >>
> >>> aniket.adnaik@
> >>
> >>> &gt; wrote:
> >>>
> >>>> Hi Ravi,
> >>>>
> >>>> 1. I agree with Jihong that creation of global dictionary should be
> >>>> optional, so that it can be disabled to improve the load performance.
> >>>> User
> >>>> should be made aware that using global dictionary may boost the query
> >>>> performance.
> >>>> 2. We should have a generic interface to manage global dictionary when
> >>>> its
> >>>> from external sources. In general, it is not a good idea to depend on
> >> too
> >>>> many external tools.
> >>>> 3. May be we should allow user to generate global dictionary
> separately
> >>>> through SQL command or similar. Something like materialized view. This
> >>>> means carbon should avoid using local dictionary and do late
> >>>> materialization when global dictionary is present.
> >>>> 4. May be we should think of some ways to create global dictionary
> >> lazily
> >>>> as we serve SELECT queries. Implementation may not be that straight
> >>>> forward. Not sure if its worth the effort.
> >>>>
> >>>> Best Regards,
> >>>> Aniket
> >>>>
> >>>>
> >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
> >>
> >>> Jihong.Ma@
> >>
> >>> &gt; wrote:
> >>>>
> >>>>>
> >>>>> A rather straight option is allow user to supply global dictionary
> >>>>> generated somewhere else or we build a separate tool just for
> >>>> generating
> >>>> as
> >>>>> well updating dictionary. Then the general normal data loading
> process
> >>>> will
> >>>>> encode columns with local dictionary if not supplied.  This should
> >>>> cover
> >>>>> majority of cases for low-medium cardinality column. For the cases we
> >>>> have
> >>>>> to incorporate online dictionary update, use a lock mechanism to sync
> >>>> up
> >>>>> should serve the purpose.
> >>>>>
> >>>>> In another words, generating global dictionary is an optional step,
> >>>> only
> >>>>> triggered when needed, not a default step as we do currently.
> >>>>>
> >>>>> Jihong
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ravindra Pesala [mailto:
> >>
> >>> ravi.pesala@
> >>
> >>> ]
> >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> >>>>> To: dev
> >>>>> Subject: Discussion(New feature) regarding single pass data loading
> >>>>> solution.
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This discussion is regarding single pass data load solution.
> >>>>>
> >>>>> Currently data is loading to carbon in 2 pass/jobs
> >>>>> 1. Generating global dictionary using spark job.
> >>>>> 2. Encode the data with dictionary values and create carbondata
> >> files.
> >>>>> This 2 pass solution has many disadvantages like it needs to read the
> >>>> data
> >>>>> twice in case of csv files input or it needs to execute dataframe
> >> twice
> >>>> if
> >>>>> data is loaded from dataframe.
> >>>>>
> >>>>> In order to overcome from above issues of 2 pass dataloading, we can
> >>>> have
> >>>>> single pass dataloading and following are the alternate solutions.
> >>>>>
> >>>>> Use local dictionary
> >>>>> Use local dictionary for each carbondata file while loading data, but
> >>>> it
> >>>>> may lead to query performance degradation and more memory footprint.
> >>>>>
> >>>>> Use KV store/distributed map.
> >>>>> *HBase/Cassandra cluster : *
> >>>>>  Dictionary data would be stored in KV store and generates the
> >>>> dictionary
> >>>>> value if it is not present in it. We all know the pros/cons of Hbase
> >>>> but
> >>>>> following are few.
> >>>>>  Pros : These are apache licensed
> >>>>>         Easy to implement to store/retreive dictionary values.
> >>>>>         Performance need to be evaluated.
> >>>>>
> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>
> >>>>> *Hazlecast distributed map : *
> >>>>>  Dictionary data could be saved in distributed concurrent hash map of
> >>>>> hazlecast. It is in-memory map and partioned as per number of nodes.
> >>>> And
> >>>>> even we can maintain the backups using sync/async functionality to
> >>>> avoid
> >>>>> the data loss when instance is down. We no need to maintain seperate
> >>>>> cluster for it as it can run on executor jvm itself.
> >>>>>  Pros: It is apache licensed.
> >>>>>        No need to maintain seperate cluster as instances can run in
> >>>>> executor jvms.
> >>>>>        Easy to implement and store/retreive dictionary values.
> >>>>>        It is pure java implementation.
> >>>>>        There is no master/slave concept and no single point failure.
> >>>>>
> >>>>>  Cons: Performance need to be evaluated.
> >>>>>
> >>>>> *Redis distributed map : *
> >>>>>    It is also in-memory map but it is coded in c language so we
> >> should
> >>>>> have java client libraries to interact with redis. Need to maintain
> >>>>> seperate cluster for it. It also can partition the data.
> >>>>>  Pros : More feature rich than Hazlecast.
> >>>>>         Easy to implement and store/retreive dictionary values.
> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> >>>>> dictionary.
> >>>>>         May not be suitable for big data stack.
> >>>>>         It is BSD licensed (Not sure whether we can use or not)
> >>>>>  Online performance figures says it is little slower than hazlecast.
> >>>>>
> >>>>> Please let me know which would be best fit for our loading solution.
> >>>> And
> >>>>> please add any other suitable solution if I missed.
> >>>>> --
> >>>>> Thanks & Regards,
> >>>>> Ravi
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Ravi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-carbondata-
> >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> >> Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> >> at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Discussion(New feature) regarding single pass data loading solution.

ravipesala
Hi Jihong,

Yes,  In HazleCast, it maintains only part of the data in one node because
it splits into partitions and allocate partitions ownership to the nodes.
But if the requested data is not present in that node, it still can get
data from another partition from cluster if available. Any way we can
maintain the local total data cache in each node and we look for the key
only if it is not available in the local cache and update the local cache
once it is retrieved from hazlecast.

Yes it allows data backup in multiple nodes as per configuration for high
availability. The backup is done through sync/async mode and consistency is
guaranteed if we use sync mode backup, because if you put some key-value to
hazlecast map it blocks the call till it copies to all the backup nodes in
memory. And also hazlecast map supports locks to ensure data consistency,
we can use API's like map.putIfAbsent or map.lock & map.unlock features.

Thanks,
Ravi.

On 18 October 2016 at 00:08, Jihong Ma <[hidden email]> wrote:

> Hi Ravi,
>
> I took a quick look at Hazlecast, what they offer is a distributed map
> across cluster (on any single node only portion of the map is stored), to
> facilitate parallel data loading I think we need a complete copy on each
> node, is this the structure we are looking for?
>
> it does allow map in-memory backup in case one node goes down, to ensure
> its persistency, they allow storing map to db, but requires implementing
> their API to hook them up, there are async/ sync mode supported with no
> guarantee in terms of consistency, unless going further for a transaction
> support, 2-phase commit/XA are offered with read-committed isolation, to
> achieve that is quite complicated when we need to ensure ACID on changes to
> the map. I suggest you to investigate further to understand the implication
> and effort.
>
> We all understand We couldn't afford any inconsistency on dictionary, that
> means we couldn't decode the data back correctly. correctness is even more
> critical compared to performance.
>
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:[hidden email]]
> Sent: Saturday, October 15, 2016 12:50 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi Jacky/Jihong,
>
> I agree that new dictionary values are less in case of incremental data
> load but that is completely depends on user data scenarios.  In some
> user scenarios new dictionary values may be more we cannot overrule that.
> And also for users convenience we should provide single pass solution with
> out insisting them to run external tool first. We can provide the option to
> run external tool first and provide dictionary to improve performance.
>
> My opinion is better to use some professional distributed map like
> Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
> have separate cluster, it can form the cluster within the executor jvm's .
> May be we can have a try, after all it will be just one interface
> implementation for dictionary generation. We can have multiple
> implementations and then decide based on optimal performance.
>
> Regards,
> Ravi
>
> On 15 October 2016 at 10:50, Jacky Li <[hidden email]> wrote:
>
> > Hi,
> >
> > I can offer one more approach for this discussion, since new dictionary
> > values are rare in case of incremental load (ensure first load having as
> > much dictionary value as possible), so synchronization should be rare. So
> > how about using Zookeeper + HDFS file to provide this service. This is
> what
> > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> > dictionary interface.
> > It has the benefit of
> > 1. automated: without bordering the user
> > 2. not introducing more dependency: we already using zookeeper and HDFS.
> > 3. performance? since new dictionary value and synchronization is rare.
> >
> > What do you think?
> >
> > Regards,
> > Jacky
> >
> > > 在 2016年10月15日,上午2:38,Jihong Ma <[hidden email]> 写道:
> > >
> > > Hi Ravi,
> > >
> > > The major concern I have for generating global dictionary from scratch
> > with a single scan is performance, the way to handle an occasional update
> > to the dictionary is way simpler and cost effective in terms of
> > synchronization cost and refresh the global/local cache copy.
> > >
> > > There are a lot to worry about for distributed map, and leveraging KV
> > store is overkill if simply just for dictionary generation.
> > >
> > > Regards.
> > >
> > > Jihong
> > >
> > > -----Original Message-----
> > > From: Ravindra Pesala [mailto:[hidden email]]
> > > Sent: Friday, October 14, 2016 11:03 AM
> > > To: dev
> > > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> > >
> > > Hi Jihong,
> > >
> > > I agree, we can use external tool for first load, but for incremental
> > load
> > > we should have solution to add global dictionary. So this solution
> should
> > > be enough to generate global dictionary even if user does not use
> > external
> > > tool for first time. That solution could be distributed map or KV
> store.
> > >
> > > Regards,
> > > Ravi.
> > >
> > > On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
> > >
> > >> Hi Liang,
> > >>
> > >> This tool is more or less like the first load, the first time after
> > table
> > >> is created, any subsequent loads/incremental loads will proceed and is
> > >> capable of updating the global dictionary when it encounters new
> value,
> > >> this is easiest way of achieving 1 pass data loading process without
> too
> > >> much overhead.
> > >>
> > >> Since this tool is only triggered once per table, not considered too
> > much
> > >> burden on the end users. Making global dictionary generation out of
> the
> > way
> > >> of regular data loading is the key here.
> > >>
> > >> Jihong
> > >>
> > >> -----Original Message-----
> > >> From: Liang Chen [mailto:[hidden email]]
> > >> Sent: Thursday, October 13, 2016 5:39 PM
> > >> To: [hidden email]
> > >> Subject: RE: Discussion(New feature) regarding single pass data
> loading
> > >> solution.
> > >>
> > >> Hi jihong
> > >>
> > >> I am not sure that users can accept to use extra tool to do this work,
> > >> because provide tool or do scan at first time per table for most of
> > global
> > >> dict are same cost from users perspective, and maintain the dict file
> > also
> > >> be same cost, they always expecting that system can automatically and
> > >> internally generate dict file during loading data.
> > >>
> > >> Can we consider this:
> > >> first load: make scan to generate most of global dict file, then copy
> > this
> > >> file to each load node for subsequent loading
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> Jihong Ma wrote
> > >>>>>>> the question is what would be the default implementation? Load
> data
> > >> without dictionary?
> > >>>
> > >>> My thought is we can provide a tool to generate global dictionary
> using
> > >>> sample data set, so the initial global dictionaries is available
> before
> > >>> normal data loading. We shall be able to perform encoding based on
> > that,
> > >>> we only need to handle occasionally adding entries while loading. For
> > >>> columns specified with global dictionary encoding, but dictionary is
> > not
> > >>> placed before data loading, we error out and direct user to use the
> > tool
> > >>> first.
> > >>>
> > >>> Make sense?
> > >>>
> > >>> Jihong
> > >>>
> > >>> -----Original Message-----
> > >>> From: Ravindra Pesala [mailto:
> > >>
> > >>> ravi.pesala@
> > >>
> > >>> ]
> > >>> Sent: Thursday, October 13, 2016 1:12 AM
> > >>> To: dev
> > >>> Subject: Re: Discussion(New feature) regarding single pass data
> loading
> > >>> solution.
> > >>>
> > >>> Hi Jihong/Aniket,
> > >>>
> > >>> In the current implementation of carbondata we are already handling
> > >>> external dictionary while loading the data.
> > >>> But here the question is what would be the default implementation?
> Load
> > >>> data with out dictionary?
> > >>>
> > >>>
> > >>> Regards,
> > >>> Ravi
> > >>>
> > >>> On 13 October 2016 at 03:50, Aniket Adnaik &lt;
> > >>
> > >>> aniket.adnaik@
> > >>
> > >>> &gt; wrote:
> > >>>
> > >>>> Hi Ravi,
> > >>>>
> > >>>> 1. I agree with Jihong that creation of global dictionary should be
> > >>>> optional, so that it can be disabled to improve the load
> performance.
> > >>>> User
> > >>>> should be made aware that using global dictionary may boost the
> query
> > >>>> performance.
> > >>>> 2. We should have a generic interface to manage global dictionary
> when
> > >>>> its
> > >>>> from external sources. In general, it is not a good idea to depend
> on
> > >> too
> > >>>> many external tools.
> > >>>> 3. May be we should allow user to generate global dictionary
> > separately
> > >>>> through SQL command or similar. Something like materialized view.
> This
> > >>>> means carbon should avoid using local dictionary and do late
> > >>>> materialization when global dictionary is present.
> > >>>> 4. May be we should think of some ways to create global dictionary
> > >> lazily
> > >>>> as we serve SELECT queries. Implementation may not be that straight
> > >>>> forward. Not sure if its worth the effort.
> > >>>>
> > >>>> Best Regards,
> > >>>> Aniket
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
> > >>
> > >>> Jihong.Ma@
> > >>
> > >>> &gt; wrote:
> > >>>>
> > >>>>>
> > >>>>> A rather straight option is allow user to supply global dictionary
> > >>>>> generated somewhere else or we build a separate tool just for
> > >>>> generating
> > >>>> as
> > >>>>> well updating dictionary. Then the general normal data loading
> > process
> > >>>> will
> > >>>>> encode columns with local dictionary if not supplied.  This should
> > >>>> cover
> > >>>>> majority of cases for low-medium cardinality column. For the cases
> we
> > >>>> have
> > >>>>> to incorporate online dictionary update, use a lock mechanism to
> sync
> > >>>> up
> > >>>>> should serve the purpose.
> > >>>>>
> > >>>>> In another words, generating global dictionary is an optional step,
> > >>>> only
> > >>>>> triggered when needed, not a default step as we do currently.
> > >>>>>
> > >>>>> Jihong
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Ravindra Pesala [mailto:
> > >>
> > >>> ravi.pesala@
> > >>
> > >>> ]
> > >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> > >>>>> To: dev
> > >>>>> Subject: Discussion(New feature) regarding single pass data loading
> > >>>>> solution.
> > >>>>>
> > >>>>> Hi All,
> > >>>>>
> > >>>>> This discussion is regarding single pass data load solution.
> > >>>>>
> > >>>>> Currently data is loading to carbon in 2 pass/jobs
> > >>>>> 1. Generating global dictionary using spark job.
> > >>>>> 2. Encode the data with dictionary values and create carbondata
> > >> files.
> > >>>>> This 2 pass solution has many disadvantages like it needs to read
> the
> > >>>> data
> > >>>>> twice in case of csv files input or it needs to execute dataframe
> > >> twice
> > >>>> if
> > >>>>> data is loaded from dataframe.
> > >>>>>
> > >>>>> In order to overcome from above issues of 2 pass dataloading, we
> can
> > >>>> have
> > >>>>> single pass dataloading and following are the alternate solutions.
> > >>>>>
> > >>>>> Use local dictionary
> > >>>>> Use local dictionary for each carbondata file while loading data,
> but
> > >>>> it
> > >>>>> may lead to query performance degradation and more memory
> footprint.
> > >>>>>
> > >>>>> Use KV store/distributed map.
> > >>>>> *HBase/Cassandra cluster : *
> > >>>>>  Dictionary data would be stored in KV store and generates the
> > >>>> dictionary
> > >>>>> value if it is not present in it. We all know the pros/cons of
> Hbase
> > >>>> but
> > >>>>> following are few.
> > >>>>>  Pros : These are apache licensed
> > >>>>>         Easy to implement to store/retreive dictionary values.
> > >>>>>         Performance need to be evaluated.
> > >>>>>
> > >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> > >>>>> dictionary.
> > >>>>>
> > >>>>> *Hazlecast distributed map : *
> > >>>>>  Dictionary data could be saved in distributed concurrent hash map
> of
> > >>>>> hazlecast. It is in-memory map and partioned as per number of
> nodes.
> > >>>> And
> > >>>>> even we can maintain the backups using sync/async functionality to
> > >>>> avoid
> > >>>>> the data loss when instance is down. We no need to maintain
> seperate
> > >>>>> cluster for it as it can run on executor jvm itself.
> > >>>>>  Pros: It is apache licensed.
> > >>>>>        No need to maintain seperate cluster as instances can run in
> > >>>>> executor jvms.
> > >>>>>        Easy to implement and store/retreive dictionary values.
> > >>>>>        It is pure java implementation.
> > >>>>>        There is no master/slave concept and no single point
> failure.
> > >>>>>
> > >>>>>  Cons: Performance need to be evaluated.
> > >>>>>
> > >>>>> *Redis distributed map : *
> > >>>>>    It is also in-memory map but it is coded in c language so we
> > >> should
> > >>>>> have java client libraries to interact with redis. Need to maintain
> > >>>>> seperate cluster for it. It also can partition the data.
> > >>>>>  Pros : More feature rich than Hazlecast.
> > >>>>>         Easy to implement and store/retreive dictionary values.
> > >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> > >>>>> dictionary.
> > >>>>>         May not be suitable for big data stack.
> > >>>>>         It is BSD licensed (Not sure whether we can use or not)
> > >>>>>  Online performance figures says it is little slower than
> hazlecast.
> > >>>>>
> > >>>>> Please let me know which would be best fit for our loading
> solution.
> > >>>> And
> > >>>>> please add any other suitable solution if I missed.
> > >>>>> --
> > >>>>> Thanks & Regards,
> > >>>>> Ravi
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks & Regards,
> > >>> Ravi
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context: http://apache-carbondata-
> > >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> > >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> > >> Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > >> at Nabble.com.
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> >
> >
> >
> >
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

RE: Discussion(New feature) regarding single pass data loading solution.

Jihong Ma
Hi Ravi,

making the memory backup copy consistent is only part of the story, we also need an on-disk backup  (through DB they support) to be consistent with the in-memory copy, how to achieve that? Probably the safest way is leveraging their transaction support, please look into what they can vs. can't do as well as the amount of effort/complexity required.

Jihong



-----Original Message-----
From: Ravindra Pesala [mailto:[hidden email]]
Sent: Tuesday, October 18, 2016 7:21 AM
To: dev
Subject: Re: Discussion(New feature) regarding single pass data loading solution.

Hi Jihong,

Yes,  In HazleCast, it maintains only part of the data in one node because
it splits into partitions and allocate partitions ownership to the nodes.
But if the requested data is not present in that node, it still can get
data from another partition from cluster if available. Any way we can
maintain the local total data cache in each node and we look for the key
only if it is not available in the local cache and update the local cache
once it is retrieved from hazlecast.

Yes it allows data backup in multiple nodes as per configuration for high
availability. The backup is done through sync/async mode and consistency is
guaranteed if we use sync mode backup, because if you put some key-value to
hazlecast map it blocks the call till it copies to all the backup nodes in
memory. And also hazlecast map supports locks to ensure data consistency,
we can use API's like map.putIfAbsent or map.lock & map.unlock features.

Thanks,
Ravi.

On 18 October 2016 at 00:08, Jihong Ma <[hidden email]> wrote:

> Hi Ravi,
>
> I took a quick look at Hazlecast, what they offer is a distributed map
> across cluster (on any single node only portion of the map is stored), to
> facilitate parallel data loading I think we need a complete copy on each
> node, is this the structure we are looking for?
>
> it does allow map in-memory backup in case one node goes down, to ensure
> its persistency, they allow storing map to db, but requires implementing
> their API to hook them up, there are async/ sync mode supported with no
> guarantee in terms of consistency, unless going further for a transaction
> support, 2-phase commit/XA are offered with read-committed isolation, to
> achieve that is quite complicated when we need to ensure ACID on changes to
> the map. I suggest you to investigate further to understand the implication
> and effort.
>
> We all understand We couldn't afford any inconsistency on dictionary, that
> means we couldn't decode the data back correctly. correctness is even more
> critical compared to performance.
>
>
> Jihong
>
> -----Original Message-----
> From: Ravindra Pesala [mailto:[hidden email]]
> Sent: Saturday, October 15, 2016 12:50 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi Jacky/Jihong,
>
> I agree that new dictionary values are less in case of incremental data
> load but that is completely depends on user data scenarios.  In some
> user scenarios new dictionary values may be more we cannot overrule that.
> And also for users convenience we should provide single pass solution with
> out insisting them to run external tool first. We can provide the option to
> run external tool first and provide dictionary to improve performance.
>
> My opinion is better to use some professional distributed map like
> Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
> have separate cluster, it can form the cluster within the executor jvm's .
> May be we can have a try, after all it will be just one interface
> implementation for dictionary generation. We can have multiple
> implementations and then decide based on optimal performance.
>
> Regards,
> Ravi
>
> On 15 October 2016 at 10:50, Jacky Li <[hidden email]> wrote:
>
> > Hi,
> >
> > I can offer one more approach for this discussion, since new dictionary
> > values are rare in case of incremental load (ensure first load having as
> > much dictionary value as possible), so synchronization should be rare. So
> > how about using Zookeeper + HDFS file to provide this service. This is
> what
> > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> > dictionary interface.
> > It has the benefit of
> > 1. automated: without bordering the user
> > 2. not introducing more dependency: we already using zookeeper and HDFS.
> > 3. performance? since new dictionary value and synchronization is rare.
> >
> > What do you think?
> >
> > Regards,
> > Jacky
> >
> > > 在 2016年10月15日,上午2:38,Jihong Ma <[hidden email]> 写道:
> > >
> > > Hi Ravi,
> > >
> > > The major concern I have for generating global dictionary from scratch
> > with a single scan is performance, the way to handle an occasional update
> > to the dictionary is way simpler and cost effective in terms of
> > synchronization cost and refresh the global/local cache copy.
> > >
> > > There are a lot to worry about for distributed map, and leveraging KV
> > store is overkill if simply just for dictionary generation.
> > >
> > > Regards.
> > >
> > > Jihong
> > >
> > > -----Original Message-----
> > > From: Ravindra Pesala [mailto:[hidden email]]
> > > Sent: Friday, October 14, 2016 11:03 AM
> > > To: dev
> > > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> > >
> > > Hi Jihong,
> > >
> > > I agree, we can use external tool for first load, but for incremental
> > load
> > > we should have solution to add global dictionary. So this solution
> should
> > > be enough to generate global dictionary even if user does not use
> > external
> > > tool for first time. That solution could be distributed map or KV
> store.
> > >
> > > Regards,
> > > Ravi.
> > >
> > > On 14 October 2016 at 23:12, Jihong Ma <[hidden email]> wrote:
> > >
> > >> Hi Liang,
> > >>
> > >> This tool is more or less like the first load, the first time after
> > table
> > >> is created, any subsequent loads/incremental loads will proceed and is
> > >> capable of updating the global dictionary when it encounters new
> value,
> > >> this is easiest way of achieving 1 pass data loading process without
> too
> > >> much overhead.
> > >>
> > >> Since this tool is only triggered once per table, not considered too
> > much
> > >> burden on the end users. Making global dictionary generation out of
> the
> > way
> > >> of regular data loading is the key here.
> > >>
> > >> Jihong
> > >>
> > >> -----Original Message-----
> > >> From: Liang Chen [mailto:[hidden email]]
> > >> Sent: Thursday, October 13, 2016 5:39 PM
> > >> To: [hidden email]
> > >> Subject: RE: Discussion(New feature) regarding single pass data
> loading
> > >> solution.
> > >>
> > >> Hi jihong
> > >>
> > >> I am not sure that users can accept to use extra tool to do this work,
> > >> because provide tool or do scan at first time per table for most of
> > global
> > >> dict are same cost from users perspective, and maintain the dict file
> > also
> > >> be same cost, they always expecting that system can automatically and
> > >> internally generate dict file during loading data.
> > >>
> > >> Can we consider this:
> > >> first load: make scan to generate most of global dict file, then copy
> > this
> > >> file to each load node for subsequent loading
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >>
> > >> Jihong Ma wrote
> > >>>>>>> the question is what would be the default implementation? Load
> data
> > >> without dictionary?
> > >>>
> > >>> My thought is we can provide a tool to generate global dictionary
> using
> > >>> sample data set, so the initial global dictionaries is available
> before
> > >>> normal data loading. We shall be able to perform encoding based on
> > that,
> > >>> we only need to handle occasionally adding entries while loading. For
> > >>> columns specified with global dictionary encoding, but dictionary is
> > not
> > >>> placed before data loading, we error out and direct user to use the
> > tool
> > >>> first.
> > >>>
> > >>> Make sense?
> > >>>
> > >>> Jihong
> > >>>
> > >>> -----Original Message-----
> > >>> From: Ravindra Pesala [mailto:
> > >>
> > >>> ravi.pesala@
> > >>
> > >>> ]
> > >>> Sent: Thursday, October 13, 2016 1:12 AM
> > >>> To: dev
> > >>> Subject: Re: Discussion(New feature) regarding single pass data
> loading
> > >>> solution.
> > >>>
> > >>> Hi Jihong/Aniket,
> > >>>
> > >>> In the current implementation of carbondata we are already handling
> > >>> external dictionary while loading the data.
> > >>> But here the question is what would be the default implementation?
> Load
> > >>> data with out dictionary?
> > >>>
> > >>>
> > >>> Regards,
> > >>> Ravi
> > >>>
> > >>> On 13 October 2016 at 03:50, Aniket Adnaik &lt;
> > >>
> > >>> aniket.adnaik@
> > >>
> > >>> &gt; wrote:
> > >>>
> > >>>> Hi Ravi,
> > >>>>
> > >>>> 1. I agree with Jihong that creation of global dictionary should be
> > >>>> optional, so that it can be disabled to improve the load
> performance.
> > >>>> User
> > >>>> should be made aware that using global dictionary may boost the
> query
> > >>>> performance.
> > >>>> 2. We should have a generic interface to manage global dictionary
> when
> > >>>> its
> > >>>> from external sources. In general, it is not a good idea to depend
> on
> > >> too
> > >>>> many external tools.
> > >>>> 3. May be we should allow user to generate global dictionary
> > separately
> > >>>> through SQL command or similar. Something like materialized view.
> This
> > >>>> means carbon should avoid using local dictionary and do late
> > >>>> materialization when global dictionary is present.
> > >>>> 4. May be we should think of some ways to create global dictionary
> > >> lazily
> > >>>> as we serve SELECT queries. Implementation may not be that straight
> > >>>> forward. Not sure if its worth the effort.
> > >>>>
> > >>>> Best Regards,
> > >>>> Aniket
> > >>>>
> > >>>>
> > >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
> > >>
> > >>> Jihong.Ma@
> > >>
> > >>> &gt; wrote:
> > >>>>
> > >>>>>
> > >>>>> A rather straight option is allow user to supply global dictionary
> > >>>>> generated somewhere else or we build a separate tool just for
> > >>>> generating
> > >>>> as
> > >>>>> well updating dictionary. Then the general normal data loading
> > process
> > >>>> will
> > >>>>> encode columns with local dictionary if not supplied.  This should
> > >>>> cover
> > >>>>> majority of cases for low-medium cardinality column. For the cases
> we
> > >>>> have
> > >>>>> to incorporate online dictionary update, use a lock mechanism to
> sync
> > >>>> up
> > >>>>> should serve the purpose.
> > >>>>>
> > >>>>> In another words, generating global dictionary is an optional step,
> > >>>> only
> > >>>>> triggered when needed, not a default step as we do currently.
> > >>>>>
> > >>>>> Jihong
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Ravindra Pesala [mailto:
> > >>
> > >>> ravi.pesala@
> > >>
> > >>> ]
> > >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
> > >>>>> To: dev
> > >>>>> Subject: Discussion(New feature) regarding single pass data loading
> > >>>>> solution.
> > >>>>>
> > >>>>> Hi All,
> > >>>>>
> > >>>>> This discussion is regarding single pass data load solution.
> > >>>>>
> > >>>>> Currently data is loading to carbon in 2 pass/jobs
> > >>>>> 1. Generating global dictionary using spark job.
> > >>>>> 2. Encode the data with dictionary values and create carbondata
> > >> files.
> > >>>>> This 2 pass solution has many disadvantages like it needs to read
> the
> > >>>> data
> > >>>>> twice in case of csv files input or it needs to execute dataframe
> > >> twice
> > >>>> if
> > >>>>> data is loaded from dataframe.
> > >>>>>
> > >>>>> In order to overcome from above issues of 2 pass dataloading, we
> can
> > >>>> have
> > >>>>> single pass dataloading and following are the alternate solutions.
> > >>>>>
> > >>>>> Use local dictionary
> > >>>>> Use local dictionary for each carbondata file while loading data,
> but
> > >>>> it
> > >>>>> may lead to query performance degradation and more memory
> footprint.
> > >>>>>
> > >>>>> Use KV store/distributed map.
> > >>>>> *HBase/Cassandra cluster : *
> > >>>>>  Dictionary data would be stored in KV store and generates the
> > >>>> dictionary
> > >>>>> value if it is not present in it. We all know the pros/cons of
> Hbase
> > >>>> but
> > >>>>> following are few.
> > >>>>>  Pros : These are apache licensed
> > >>>>>         Easy to implement to store/retreive dictionary values.
> > >>>>>         Performance need to be evaluated.
> > >>>>>
> > >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> > >>>>> dictionary.
> > >>>>>
> > >>>>> *Hazlecast distributed map : *
> > >>>>>  Dictionary data could be saved in distributed concurrent hash map
> of
> > >>>>> hazlecast. It is in-memory map and partioned as per number of
> nodes.
> > >>>> And
> > >>>>> even we can maintain the backups using sync/async functionality to
> > >>>> avoid
> > >>>>> the data loss when instance is down. We no need to maintain
> seperate
> > >>>>> cluster for it as it can run on executor jvm itself.
> > >>>>>  Pros: It is apache licensed.
> > >>>>>        No need to maintain seperate cluster as instances can run in
> > >>>>> executor jvms.
> > >>>>>        Easy to implement and store/retreive dictionary values.
> > >>>>>        It is pure java implementation.
> > >>>>>        There is no master/slave concept and no single point
> failure.
> > >>>>>
> > >>>>>  Cons: Performance need to be evaluated.
> > >>>>>
> > >>>>> *Redis distributed map : *
> > >>>>>    It is also in-memory map but it is coded in c language so we
> > >> should
> > >>>>> have java client libraries to interact with redis. Need to maintain
> > >>>>> seperate cluster for it. It also can partition the data.
> > >>>>>  Pros : More feature rich than Hazlecast.
> > >>>>>         Easy to implement and store/retreive dictionary values.
> > >>>>>  Cons : Need to maintain seperate cluster for maintaining global
> > >>>>> dictionary.
> > >>>>>         May not be suitable for big data stack.
> > >>>>>         It is BSD licensed (Not sure whether we can use or not)
> > >>>>>  Online performance figures says it is little slower than
> hazlecast.
> > >>>>>
> > >>>>> Please let me know which would be best fit for our loading
> solution.
> > >>>> And
> > >>>>> please add any other suitable solution if I missed.
> > >>>>> --
> > >>>>> Thanks & Regards,
> > >>>>> Ravi
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks & Regards,
> > >>> Ravi
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context: http://apache-carbondata-
> > >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
> > >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
> > >> Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > >> at Nabble.com.
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> >
> >
> >
> >
>
>
> --
> Thanks & Regards,
> Ravi
>



--
Thanks & Regards,
Ravi