Apache CarbonData Dev Mailing List archive

Re: Propose feature change in CarbonData 2.0

Posted by akashnilugal@gmail.com on Dec 06, 2019; 12:37pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540p87876.html

Hi,

1. Global Dict - 0
2. Bucket and 3. Cutstom Partition +1
4. Batch sort +1
5 page level 0
6. preaggregate and old timeseries + 1
7. stored by +1

Store optimization +1

I also suggest the refactoring below:
[DISCUSSION] Segment file improvement for Update and delete case.
you can find the problem statement in below discussion thread.
https://lists.apache.org/list.html?dev@...:2019-09

Regards,
Akash

On 2019/11/28 16:32:57, Jacky Li <[hidden email]> wrote:

>
> Hi Community,
>
> As we are moving to CarbonData 2.0, in order to keep the project moving
> forward fast and stable, it is necessary to do some refactory and clean up
> obsoleted features before introducing new features.
>
> To do that, I propose making following features obsoleted and not supported
> since 2.0. In my opinion, these features are seldom used.
>
> 1. Global dictionary
> After spark 2.x, the aggregation is much faster since project tungsten, so
> Global Dictionary is not much useful but it makes data loading slow and need
> very complex SQL plan transformation.
>
> 2. Bucket
> Bucket feature of carbon is intented to improve join performance, but actual
> improvement is very limited
>
> 3. Carbon custom partition
> Since now we have Hive standard partition, old custom partition is not very
> useful
>
> 4. BATCH_SORT
> I have not seen anyone use this feature
>
> 5. Page level inverse index
> This is arguable, I understand in a very specific scenario (when there are
> many columns in IN filter) it has benefit, but it slow down the data loading
> and make encoding code very complex
>
> 5. old preaggregate and time series datamap implementation
> As we have introduced MV, these two features can be dropped. And we can
> following the standard SQL to have a new syntax to create MV: CREATE
> MATERIALIZED VIEW
>
> 6. Lucene datamap
> This feature is not well implemented, as it will read too much index into
> memroy thus creating memory problems in most cases.
>
> 7. STORED BY
> We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING).
>
>
> And there are some internal refactory we can do:
> 1. Unify dimension and measure
>
> 2. Keep the column order the same as schema order
>
> 3. Spark integration refactory based on Spark extension interface
>
> 4. Store optimization PR2729
>
>
> The aim of this proposal is to make CarbonData code cleaner and reduce
> community's maitenance effort.
> What do you think of it?
>
>
> Regards,
> Jacky
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>