Apache CarbonData Dev Mailing List archive

Propose feature change in CarbonData 2.0

Classic

List

Threaded

12 messages Options

Jacky Li

Propose feature change in CarbonData 2.0

Hi Community,

As we are moving to CarbonData 2.0, in order to keep the project moving
forward fast and stable, it is necessary to do some refactory and clean up
obsoleted features before introducing new features.

To do that, I propose making following features obsoleted and not supported
since 2.0. In my opinion, these features are seldom used.

1. Global dictionary
After spark 2.x, the aggregation is much faster since project tungsten, so
Global Dictionary is not much useful but it makes data loading slow and need
very complex SQL plan transformation.

2. Bucket
Bucket feature of carbon is intented to improve join performance, but actual
improvement is very limited

3. Carbon custom partition
Since now we have Hive standard partition, old custom partition is not very
useful

4. BATCH_SORT
I have not seen anyone use this feature

5. Page level inverse index
This is arguable, I understand in a very specific scenario (when there are
many columns in IN filter) it has benefit, but it slow down the data loading
and make encoding code very complex

5. old preaggregate and time series datamap implementation
As we have introduced MV, these two features can be dropped. And we can
following the standard SQL to have a new syntax to create MV: CREATE
MATERIALIZED VIEW

6. Lucene datamap
This feature is not well implemented, as it will read too much index into
memroy thus creating memory problems in most cases.

7. STORED BY
We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING).

And there are some internal refactory we can do:
1. Unify dimension and measure

2. Keep the column order the same as schema order

3. Spark integration refactory based on Spark extension interface

4. Store optimization PR2729

The aim of this proposal is to make CarbonData code cleaner and reduce
community's maitenance effort.
What do you think of it?

Regards,
Jacky

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

David CaiQiang

Re: Propose feature change in CarbonData 2.0

+1

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

xubo245

Re: Propose feature change in CarbonData 2.0

In reply to this post by Jacky Li

In my opinion, carbon 2.0 is the right time to clean up some unused featture
to make code cleaner and reduce maintenance effort

+1 agree ,-1 disagree. 0, other.

+1： 1，2，3，5, 5, 7
0: 4,
-1：6, but should be optimzied.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Benoit Rousseau

Re: Propose feature change in CarbonData 2.0

Hi,

Considering 5:
In carbon 2.0 will MV be "always in sync" like carbon 1 pre aggregate datamap or will they require action to be put back online at each update ?

Vertica, Clickhouse, Vector and some other first class OLAP engine offers "always in sync" pre aggregate views which are very convenient.

Thanks,
Benoit

> On 29 Nov 2019, at 13:19, xubo245 <[hidden email]> wrote:
>

In my opinion, carbon 2.0 is the right time to clean up some unused featture
to make code cleaner and reduce maintenance effort

+1 agree ,-1 disagree. 0, other.

+1： 1，2，3，5, 5, 7
0: 4,
-1：6, but should be optimzied.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: Propose feature change in CarbonData 2.0

Hi Benoit,

Thanks for pointing this out.

Yes, it will be carbon 1 preaggregate datamap. The MV implementation in
CarbonData will check whether the MV is an aggregation on single table, if
yes, it will be "always in sync" (will automatically trigger load to MV
table after loading the main table).

But if the MV involved multiple table join, it need to be manually rebuild.

Regards,
Jacky

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Re: Propose feature change in CarbonData 2.0

In reply to this post by Jacky Li

Glad to see you making this proposal! The features you mentioned are really
not popular even the heavy user neither try them nor know their usage.

For 1/2/3/4/5.1/5.2/7, we can remove this features with their code. But if
we consider compatibility, the query processing will still be complex. How
can we solve this problem?

For 6, we may need to optimize it. If the problem lies in reading indices
into memory, we can another way to fix it, such as making slices or some
other ways.

As for the refactory points, I'm not sure about the 1st&2nd points.
As I know, in data loading, we group dimensions and measures while writing
sort_temp_files, this can enhance the loading performance since it can
reduce the size of file.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ravipesala

Re: Propose feature change in CarbonData 2.0

In reply to this post by Jacky Li

Hi,

Thank you for proposing. Please check my comments below.

1.Global dictionary: It was one of the prime features when it was initially
released to apache. Even though spark has introduced tungsten still it has
its benefits like compression, filtering and aggregation queries. But after
the introduction of a local dictionary, it got solved partially like
compression and filtering (cannot get the same performance as a global
dictionary). But only the major drawback here is the data load performance.
In some cases like MOLAP cube (build once) it is still might be useful.
Vote: 0

2. Bucket: It is a very useful feature if we use it. if we are planning to
remove better find the alternative to this feature first. Since these
features are available in spark+parquet it would be helpful for users who
want to migrate to carbon. As I know this feature was never productized and
it is still in experimental. So if we are planning to keep it better make it
productize. Vote : -1

3. Carbon custom partition: Vote : +1

4. Batch Sort : Vote : +1

5. Page level inverse index : It makes the store size bigger to store these
indexes. It is really helpful in case of multiple in filters but it is got
overshadowed by its IO and CPU performance due to its size. Vote : +1

5. old preaggregate and time series datamap implementation : Vote : +1
(remove pre-aggregate)

6. Lucene DataMap: It is a helpful feature but I guess it had performance
issues due to bad integration. It would be better if we can fix these issues
instead of removing it. Moreover, it is a separate module so there would not
be any code maintenance problem. Vote : -1

7. STORED BY : Vote : +1

refractory points:
1 & 2 : I think at this point of time it would be a massive refractory but
very less outcome. So better don't do it. Vote : -1

3 &4 : Vote : +1

Regards,
Ravindra.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xm_zzc

Re:Propose feature change in CarbonData 2.0

This post was updated on .

In reply to this post by Jacky Li

Hi:
Thank you for proposing. My votes are below:

1,3,4,5.1,5.2,7: +1
2: 0
6: -1, but should be optimzied.

And there are some internal refactory we can do:
1. Unify dimension and measure +1.

2. Keep the column order the same as schema order 0.

3. Spark integration refactory based on Spark extension interface +1

4. Store optimization PR2729 +1

In my opinion, we also can do some refactor:
1. there are many places using string[] to store data in the process of loading data, it can replace with InternalRow object to save memory;
2. remove 'streaming' property and eliminate the difference between streaming and batch table, users can insert data into a table by batch way and streaming way.

Venkata Gollamudi

Re: Propose feature change in CarbonData 2.0

In reply to this post by Jacky Li

1. Global dictionary: Value might be visible on small clusters, but Local
dictionary is definitely a scalable solution. Also depends on hdfs features
like append to previous dictionary file and no method to remove state data.
Also we don't suggest Global dictionary for high cardinality dimensions,
for low cardinality dims cost of duplicating dictionary across files is not
high. So I think better we can deprecate this considering value vs
complexity and use cases it solves. vote: +1
2. Bucketing: I think is important feature, which will considerably improve
join performance. So I feel it should not be removed. vote: -1
3, 4 vote: +1
5. Inverted index: Current inverted index might not be efficient on space
and we don't have method to detect when inverted index needs to be built
and when it is not required. This area has to be further explored for
optimising various looks ups and refactored. Like druid has an inverted
index. vote: -1
5. Old pre aggregate and time series data map implementation, vote: +1
6. Lucene datamap. This is required to be improved, than deprecating it.
vote: -1
7. Stored by : vote +1

Refactoring:
1. good to do, but need to consider effort vote: 0
2. Column order need not be according to schema order, as columns and their
order can logically change from file. vote: -1
3, 4 are required vote:+1

Regards,
Ramana

On Tue, Dec 3, 2019 at 9:53 PM 恩爸 <[hidden email]> wrote:

> Hi:
>   Thank you for proposing. My votes are below:
>
>
>   1,3,4,5.1,5.2,7:  +1
>   2:                 
>     0
>   6:                 
>     -1, but should be optimzied.
>
>
>   And there are some internal refactory we can do:
>   1. Unify dimension and measure   +1.
>
>   2. Keep the column order the same as schema order   0.
>
>   3. Spark integration refactory based on Spark extension
> interface   +1
>
>   4. Store optimization PR2729   +1
>
>   In my opinion, we also can do some  refactor:  1.
> there are many places using string[] to store data in the process of
> loading data, it can replace with InternalRow object to save  memory;
>   2. remove 'streaming' property and eliminate the difference between
> streaming and batch table, users can insert data into a table by batch way
> and streaming way.
>
>
>
>
>
>
> ------------------ Original ------------------
> From: "ravipesala [via Apache CarbonData Dev Mailing List archive]"<
> [hidden email]>;
> Date: Tue, Dec 3, 2019 06:07 PM
> To: "恩爸"<[hidden email]>;
>
> Subject: Re: Propose feature change in CarbonData 2.0
>
>
>
> Hi,
>
> Thank you for proposing. Please check my comments below.
>
> 1.Global dictionary: It was one of the prime features when it was
> initially
> released to apache. Even though spark has introduced tungsten still it has
> its benefits like compression, filtering and aggregation queries.
>  But after
> the introduction of a local dictionary, it got solved partially like
> compression and filtering (cannot get the same performance as a global
> dictionary). But only the major drawback here is the data load
> performance.
> In some cases like MOLAP cube (build once) it is still might be useful.
> Vote: 0
>
> 2. Bucket: It is a very useful feature if we use it. if we are planning to
> remove better find the alternative to this feature first. Since these
> features are available in spark+parquet it would be helpful for users who
> want to migrate to carbon. As I know this feature was never productized
> and
> it is still in experimental. So if we are planning to keep it better make
> it
> productize. Vote : -1
>
> 3. Carbon custom partition: Vote : +1
>
> 4. Batch Sort : Vote : +1
>
> 5. Page level inverse index : It makes the store size bigger to store
> these
> indexes. It is really helpful in case of multiple in filters but it is got
> overshadowed by its IO and CPU performance due to its size. Vote : +1
>
> 5.  old preaggregate and time series datamap implementation : Vote :
> +1
> (remove pre-aggregate)
>
> 6. Lucene DataMap: It is a helpful feature but I guess it had performance
> issues due to bad integration. It would be better if we can fix these
> issues
> instead of removing it. Moreover, it is a separate module so there would
> not
> be any code maintenance problem. Vote : -1
>
> 7. STORED BY : Vote : +1
>
> refractory points:
> 1 & 2 : I think at this point of time it would be a massive refractory
> but
> very less outcome. So better don't do it. Vote : -1
>
> 3 &4 : Vote : +1
>
>
>
> Regards,
> Ravindra.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
> If you reply to this email, your message will be
> added to the discussion below:
>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540p87707.html
>
> To start a new topic under Apache CarbonData Dev
> Mailing List archive, email [hidden email]
> To unsubscribe from Apache CarbonData Dev Mailing List
> archive, click here.
> NAML

Jacky Li

Re: Propose feature change in CarbonData 2.0

Hi,

Thanks for all your input, the voting summary is as below:

1. Global dictionary
No -1

2. Bucket
Two -1

3. Carbon custom partition
No -1

4. BATCH_SORT
No -1

5. Page level inverse index
One -1

5. old preaggregate and time series datamap implementation
No -1

6. Lucene datamap
Five -1

7. STORED BY
No -1

So, I have created an umbrella JIRA (CARBONDATA-3603) for these items.
Please feel free to response if anyone interested working on them

Regards,
Jacky

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

kumarvishal09

Re: Propose feature change in CarbonData 2.0

Please find my comment inline
Bucketing +1
Carbon custom partition +1
BATCH_SORT +1
old preaggregate and time series datamap implementation +1
STORED BY +1

Global dictionary:
Data loading with global dictionary is slow but aggregation, filtering,
compression is better than any other type, storing raw value or with
location dictionary. So it might be useful feature
Vote: 0

Page Level Inverted Index: -1
If user know column on which he/she is going to use IN filter it is very
useful

Lucene datamap: Performance is bad because of some code/design issue which
can be fixed -1

And there are some internal refactory we can do:
1. Unify dimension and measure:
It may improve IO performance but effort is high. 0

3. Spark integration refactory based on Spark extension interface +1

4. Store optimization PR2729 +1

-Regards
Kumar Vishal

On Thu, Dec 5, 2019 at 3:28 PM Jacky Li <[hidden email]> wrote:

> Hi,
>
> Thanks for all your input, the voting summary is as below:
>
> 1. Global dictionary
> No -1
>
> 2. Bucket
> Two -1
>
> 3. Carbon custom partition
> No -1
>
> 4. BATCH_SORT
> No -1
>
> 5. Page level inverse index
> One -1
>
> 5. old preaggregate and time series datamap implementation
> No -1
>
> 6. Lucene datamap
> Five -1
>
> 7. STORED BY
> No -1
>
> So, I have created an umbrella JIRA (CARBONDATA-3603) for these items.
> Please feel free to response if anyone interested working on them
>
> Regards,
> Jacky
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

kumar vishal

akashnilugal@gmail.com

Re: Propose feature change in CarbonData 2.0

In reply to this post by Jacky Li

Hi,

1. Global Dict - 0
2. Bucket and 3. Cutstom Partition +1
4. Batch sort +1
5 page level 0
6. preaggregate and old timeseries + 1
7. stored by +1

Store optimization +1

I also suggest the refactoring below:
[DISCUSSION] Segment file improvement for Update and delete case.
you can find the problem statement in below discussion thread.
https://lists.apache.org/list.html?dev@...:2019-09

Regards,
Akash

On 2019/11/28 16:32:57, Jacky Li <[hidden email]> wrote:

>
> Hi Community,
>
> As we are moving to CarbonData 2.0, in order to keep the project moving
> forward fast and stable, it is necessary to do some refactory and clean up
> obsoleted features before introducing new features.
>
> To do that, I propose making following features obsoleted and not supported
> since 2.0. In my opinion, these features are seldom used.
>
> 1. Global dictionary
> After spark 2.x, the aggregation is much faster since project tungsten, so
> Global Dictionary is not much useful but it makes data loading slow and need
> very complex SQL plan transformation.
>
> 2. Bucket
> Bucket feature of carbon is intented to improve join performance, but actual
> improvement is very limited
>
> 3. Carbon custom partition
> Since now we have Hive standard partition, old custom partition is not very
> useful
>
> 4. BATCH_SORT
> I have not seen anyone use this feature
>
> 5. Page level inverse index
> This is arguable, I understand in a very specific scenario (when there are
> many columns in IN filter) it has benefit, but it slow down the data loading
> and make encoding code very complex
>
> 5. old preaggregate and time series datamap implementation
> As we have introduced MV, these two features can be dropped. And we can
> following the standard SQL to have a new syntax to create MV: CREATE
> MATERIALIZED VIEW
>
> 6. Lucene datamap
> This feature is not well implemented, as it will read too much index into
> memroy thus creating memory problems in most cases.
>
> 7. STORED BY
> We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING).
>
>
> And there are some internal refactory we can do:
> 1. Unify dimension and measure
>
> 2. Keep the column order the same as schema order
>
> 3. Spark integration refactory based on Spark extension interface
>
> 4. Store optimization PR2729
>
>
> The aim of this proposal is to make CarbonData code cleaner and reduce
> community's maitenance effort.
> What do you think of it?
>
>
> Regards,
> Jacky
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>