Apache CarbonData Dev Mailing List archive

Propose feature change in CarbonData 2.0

Posted by Jacky Li on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540.html

Hi Community,

As we are moving to CarbonData 2.0, in order to keep the project moving
forward fast and stable, it is necessary to do some refactory and clean up
obsoleted features before introducing new features.

To do that, I propose making following features obsoleted and not supported
since 2.0. In my opinion, these features are seldom used.

1. Global dictionary
After spark 2.x, the aggregation is much faster since project tungsten, so
Global Dictionary is not much useful but it makes data loading slow and need
very complex SQL plan transformation.

2. Bucket
Bucket feature of carbon is intented to improve join performance, but actual
improvement is very limited

3. Carbon custom partition
Since now we have Hive standard partition, old custom partition is not very
useful

4. BATCH_SORT
I have not seen anyone use this feature

5. Page level inverse index
This is arguable, I understand in a very specific scenario (when there are
many columns in IN filter) it has benefit, but it slow down the data loading
and make encoding code very complex

5. old preaggregate and time series datamap implementation
As we have introduced MV, these two features can be dropped. And we can
following the standard SQL to have a new syntax to create MV: CREATE
MATERIALIZED VIEW

6. Lucene datamap
This feature is not well implemented, as it will read too much index into
memroy thus creating memory problems in most cases.

7. STORED BY
We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING).

And there are some internal refactory we can do:
1. Unify dimension and measure

2. Keep the column order the same as schema order

3. Spark integration refactory based on Spark extension interface

4. Store optimization PR2729

The aim of this proposal is to make CarbonData code cleaner and reduce
community's maitenance effort.
What do you think of it?

Regards,
Jacky

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/