Apache CarbonData Dev Mailing List archive

Re: Propose feature change in CarbonData 2.0

Posted by Venkata Gollamudi on Dec 04, 2019; 6:56am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540p87734.html

1. Global dictionary: Value might be visible on small clusters, but Local
dictionary is definitely a scalable solution. Also depends on hdfs features
like append to previous dictionary file and no method to remove state data.
Also we don't suggest Global dictionary for high cardinality dimensions,
for low cardinality dims cost of duplicating dictionary across files is not
high. So I think better we can deprecate this considering value vs
complexity and use cases it solves. vote: +1
2. Bucketing: I think is important feature, which will considerably improve
join performance. So I feel it should not be removed. vote: -1
3, 4 vote: +1
5. Inverted index: Current inverted index might not be efficient on space
and we don't have method to detect when inverted index needs to be built
and when it is not required. This area has to be further explored for
optimising various looks ups and refactored. Like druid has an inverted
index. vote: -1
5. Old pre aggregate and time series data map implementation, vote: +1
6. Lucene datamap. This is required to be improved, than deprecating it.
vote: -1
7. Stored by : vote +1

Refactoring:
1. good to do, but need to consider effort vote: 0
2. Column order need not be according to schema order, as columns and their
order can logically change from file. vote: -1
3, 4 are required vote:+1

Regards,
Ramana

On Tue, Dec 3, 2019 at 9:53 PM 恩爸 <[hidden email]> wrote:

> Hi:
>   Thank you for proposing. My votes are below:
>
>
>   1,3,4,5.1,5.2,7:  +1
>   2:                 
>     0
>   6:                 
>     -1, but should be optimzied.
>
>
>   And there are some internal refactory we can do:
>   1. Unify dimension and measure   +1.
>
>   2. Keep the column order the same as schema order   0.
>
>   3. Spark integration refactory based on Spark extension
> interface   +1
>
>   4. Store optimization PR2729   +1
>
>   In my opinion, we also can do some  refactor:  1.
> there are many places using string[] to store data in the process of
> loading data, it can replace with InternalRow object to save  memory;
>   2. remove 'streaming' property and eliminate the difference between
> streaming and batch table, users can insert data into a table by batch way
> and streaming way.
>
>
>
>
>
>
> ------------------ Original ------------------
> From: "ravipesala [via Apache CarbonData Dev Mailing List archive]"<
> [hidden email]>;
> Date: Tue, Dec 3, 2019 06:07 PM
> To: "恩爸"<[hidden email]>;
>
> Subject: Re: Propose feature change in CarbonData 2.0
>
>
>
> Hi,
>
> Thank you for proposing. Please check my comments below.
>
> 1.Global dictionary: It was one of the prime features when it was
> initially
> released to apache. Even though spark has introduced tungsten still it has
> its benefits like compression, filtering and aggregation queries.
>  But after
> the introduction of a local dictionary, it got solved partially like
> compression and filtering (cannot get the same performance as a global
> dictionary). But only the major drawback here is the data load
> performance.
> In some cases like MOLAP cube (build once) it is still might be useful.
> Vote: 0
>
> 2. Bucket: It is a very useful feature if we use it. if we are planning to
> remove better find the alternative to this feature first. Since these
> features are available in spark+parquet it would be helpful for users who
> want to migrate to carbon. As I know this feature was never productized
> and
> it is still in experimental. So if we are planning to keep it better make
> it
> productize. Vote : -1
>
> 3. Carbon custom partition: Vote : +1
>
> 4. Batch Sort : Vote : +1
>
> 5. Page level inverse index : It makes the store size bigger to store
> these
> indexes. It is really helpful in case of multiple in filters but it is got
> overshadowed by its IO and CPU performance due to its size. Vote : +1
>
> 5.  old preaggregate and time series datamap implementation : Vote :
> +1
> (remove pre-aggregate)
>
> 6. Lucene DataMap: It is a helpful feature but I guess it had performance
> issues due to bad integration. It would be better if we can fix these
> issues
> instead of removing it. Moreover, it is a separate module so there would
> not
> be any code maintenance problem. Vote : -1
>
> 7. STORED BY : Vote : +1
>
> refractory points:
> 1 & 2 : I think at this point of time it would be a massive refractory
> but
> very less outcome. So better don't do it. Vote : -1
>
> 3 &4 : Vote : +1
>
>
>
> Regards,
> Ravindra.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
>
>
> If you reply to this email, your message will be
> added to the discussion below:
>
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540p87707.html
>
> To start a new topic under Apache CarbonData Dev
> Mailing List archive, email [hidden email]
> To unsubscribe from Apache CarbonData Dev Mailing List
> archive, click here.
> NAML