Hi Community, As we are moving to CarbonData 2.0, in order to keep the project moving forward fast and stable, it is necessary to do some refactory and clean up obsoleted features before introducing new features. To do that, I propose making following features obsoleted and not supported since 2.0. In my opinion, these features are seldom used. 1. Global dictionary After spark 2.x, the aggregation is much faster since project tungsten, so Global Dictionary is not much useful but it makes data loading slow and need very complex SQL plan transformation. 2. Bucket Bucket feature of carbon is intented to improve join performance, but actual improvement is very limited 3. Carbon custom partition Since now we have Hive standard partition, old custom partition is not very useful 4. BATCH_SORT I have not seen anyone use this feature 5. Page level inverse index This is arguable, I understand in a very specific scenario (when there are many columns in IN filter) it has benefit, but it slow down the data loading and make encoding code very complex 5. old preaggregate and time series datamap implementation As we have introduced MV, these two features can be dropped. And we can following the standard SQL to have a new syntax to create MV: CREATE MATERIALIZED VIEW 6. Lucene datamap This feature is not well implemented, as it will read too much index into memroy thus creating memory problems in most cases. 7. STORED BY We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING). And there are some internal refactory we can do: 1. Unify dimension and measure 2. Keep the column order the same as schema order 3. Spark integration refactory based on Spark extension interface 4. Store optimization PR2729 The aim of this proposal is to make CarbonData code cleaner and reduce community's maitenance effort. What do you think of it? Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1
----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
In reply to this post by Jacky Li
In my opinion, carbon 2.0 is the right time to clean up some unused featture
to make code cleaner and reduce maintenance effort +1 agree ,-1 disagree. 0, other. +1: 1,2,3,5, 5, 7 0: 4, -1:6, but should be optimzied. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi,
Considering 5: In carbon 2.0 will MV be "always in sync" like carbon 1 pre aggregate datamap or will they require action to be put back online at each update ? Vertica, Clickhouse, Vector and some other first class OLAP engine offers "always in sync" pre aggregate views which are very convenient. Thanks, Benoit > On 29 Nov 2019, at 13:19, xubo245 <[hidden email]> wrote: > In my opinion, carbon 2.0 is the right time to clean up some unused featture to make code cleaner and reduce maintenance effort +1 agree ,-1 disagree. 0, other. +1: 1,2,3,5, 5, 7 0: 4, -1:6, but should be optimzied. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Benoit,
Thanks for pointing this out. Yes, it will be carbon 1 preaggregate datamap. The MV implementation in CarbonData will check whether the MV is an aggregation on single table, if yes, it will be "always in sync" (will automatically trigger load to MV table after loading the main table). But if the MV involved multiple table join, it need to be manually rebuild. Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Jacky Li
Glad to see you making this proposal! The features you mentioned are really
not popular even the heavy user neither try them nor know their usage. For 1/2/3/4/5.1/5.2/7, we can remove this features with their code. But if we consider compatibility, the query processing will still be complex. How can we solve this problem? For 6, we may need to optimize it. If the problem lies in reading indices into memory, we can another way to fix it, such as making slices or some other ways. As for the refactory points, I'm not sure about the 1st&2nd points. As I know, in data loading, we group dimensions and measures while writing sort_temp_files, this can enhance the loading performance since it can reduce the size of file. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Jacky Li
Hi,
Thank you for proposing. Please check my comments below. 1.Global dictionary: It was one of the prime features when it was initially released to apache. Even though spark has introduced tungsten still it has its benefits like compression, filtering and aggregation queries. But after the introduction of a local dictionary, it got solved partially like compression and filtering (cannot get the same performance as a global dictionary). But only the major drawback here is the data load performance. In some cases like MOLAP cube (build once) it is still might be useful. Vote: 0 2. Bucket: It is a very useful feature if we use it. if we are planning to remove better find the alternative to this feature first. Since these features are available in spark+parquet it would be helpful for users who want to migrate to carbon. As I know this feature was never productized and it is still in experimental. So if we are planning to keep it better make it productize. Vote : -1 3. Carbon custom partition: Vote : +1 4. Batch Sort : Vote : +1 5. Page level inverse index : It makes the store size bigger to store these indexes. It is really helpful in case of multiple in filters but it is got overshadowed by its IO and CPU performance due to its size. Vote : +1 5. old preaggregate and time series datamap implementation : Vote : +1 (remove pre-aggregate) 6. Lucene DataMap: It is a helpful feature but I guess it had performance issues due to bad integration. It would be better if we can fix these issues instead of removing it. Moreover, it is a separate module so there would not be any code maintenance problem. Vote : -1 7. STORED BY : Vote : +1 refractory points: 1 & 2 : I think at this point of time it would be a massive refractory but very less outcome. So better don't do it. Vote : -1 3 &4 : Vote : +1 Regards, Ravindra. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
This post was updated on .
In reply to this post by Jacky Li
Hi:
Thank you for proposing. My votes are below: 1,3,4,5.1,5.2,7: +1 2: 0 6: -1, but should be optimzied. And there are some internal refactory we can do: 1. Unify dimension and measure +1. 2. Keep the column order the same as schema order 0. 3. Spark integration refactory based on Spark extension interface +1 4. Store optimization PR2729 +1 In my opinion, we also can do some refactor: 1. there are many places using string[] to store data in the process of loading data, it can replace with InternalRow object to save memory; 2. remove 'streaming' property and eliminate the difference between streaming and batch table, users can insert data into a table by batch way and streaming way. |
In reply to this post by Jacky Li
1. Global dictionary: Value might be visible on small clusters, but Local
dictionary is definitely a scalable solution. Also depends on hdfs features like append to previous dictionary file and no method to remove state data. Also we don't suggest Global dictionary for high cardinality dimensions, for low cardinality dims cost of duplicating dictionary across files is not high. So I think better we can deprecate this considering value vs complexity and use cases it solves. vote: +1 2. Bucketing: I think is important feature, which will considerably improve join performance. So I feel it should not be removed. vote: -1 3, 4 vote: +1 5. Inverted index: Current inverted index might not be efficient on space and we don't have method to detect when inverted index needs to be built and when it is not required. This area has to be further explored for optimising various looks ups and refactored. Like druid has an inverted index. vote: -1 5. Old pre aggregate and time series data map implementation, vote: +1 6. Lucene datamap. This is required to be improved, than deprecating it. vote: -1 7. Stored by : vote +1 Refactoring: 1. good to do, but need to consider effort vote: 0 2. Column order need not be according to schema order, as columns and their order can logically change from file. vote: -1 3, 4 are required vote:+1 Regards, Ramana On Tue, Dec 3, 2019 at 9:53 PM 恩爸 <[hidden email]> wrote: > Hi: > Thank you for proposing. My votes are below: > > > 1,3,4,5.1,5.2,7: +1 > 2: > 0 > 6: > -1, but should be optimzied. > > > And there are some internal refactory we can do: > 1. Unify dimension and measure +1. > > 2. Keep the column order the same as schema order 0. > > 3. Spark integration refactory based on Spark extension > interface +1 > > 4. Store optimization PR2729 +1 > > In my opinion, we also can do some refactor: 1. > there are many places using string[] to store data in the process of > loading data, it can replace with InternalRow object to save memory; > 2. remove 'streaming' property and eliminate the difference between > streaming and batch table, users can insert data into a table by batch way > and streaming way. > > > > > > > ------------------ Original ------------------ > From: "ravipesala [via Apache CarbonData Dev Mailing List archive]"< > [hidden email]>; > Date: Tue, Dec 3, 2019 06:07 PM > To: "恩爸"<[hidden email]>; > > Subject: Re: Propose feature change in CarbonData 2.0 > > > > Hi, > > Thank you for proposing. Please check my comments below. > > 1.Global dictionary: It was one of the prime features when it was > initially > released to apache. Even though spark has introduced tungsten still it has > its benefits like compression, filtering and aggregation queries. > But after > the introduction of a local dictionary, it got solved partially like > compression and filtering (cannot get the same performance as a global > dictionary). But only the major drawback here is the data load > performance. > In some cases like MOLAP cube (build once) it is still might be useful. > Vote: 0 > > 2. Bucket: It is a very useful feature if we use it. if we are planning to > remove better find the alternative to this feature first. Since these > features are available in spark+parquet it would be helpful for users who > want to migrate to carbon. As I know this feature was never productized > and > it is still in experimental. So if we are planning to keep it better make > it > productize. Vote : -1 > > 3. Carbon custom partition: Vote : +1 > > 4. Batch Sort : Vote : +1 > > 5. Page level inverse index : It makes the store size bigger to store > these > indexes. It is really helpful in case of multiple in filters but it is got > overshadowed by its IO and CPU performance due to its size. Vote : +1 > > 5. old preaggregate and time series datamap implementation : Vote : > +1 > (remove pre-aggregate) > > 6. Lucene DataMap: It is a helpful feature but I guess it had performance > issues due to bad integration. It would be better if we can fix these > issues > instead of removing it. Moreover, it is a separate module so there would > not > be any code maintenance problem. Vote : -1 > > 7. STORED BY : Vote : +1 > > refractory points: > 1 & 2 : I think at this point of time it would be a massive refractory > but > very less outcome. So better don't do it. Vote : -1 > > 3 &4 : Vote : +1 > > > > Regards, > Ravindra. > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > > > > If you reply to this email, your message will be > added to the discussion below: > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Propose-feature-change-in-CarbonData-2-0-tp87540p87707.html > > To start a new topic under Apache CarbonData Dev > Mailing List archive, email [hidden email] > To unsubscribe from Apache CarbonData Dev Mailing List > archive, click here. > NAML |
Hi,
Thanks for all your input, the voting summary is as below: 1. Global dictionary No -1 2. Bucket Two -1 3. Carbon custom partition No -1 4. BATCH_SORT No -1 5. Page level inverse index One -1 5. old preaggregate and time series datamap implementation No -1 6. Lucene datamap Five -1 7. STORED BY No -1 So, I have created an umbrella JIRA (CARBONDATA-3603) for these items. Please feel free to response if anyone interested working on them Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Please find my comment inline
Bucketing +1 Carbon custom partition +1 BATCH_SORT +1 old preaggregate and time series datamap implementation +1 STORED BY +1 Global dictionary: Data loading with global dictionary is slow but aggregation, filtering, compression is better than any other type, storing raw value or with location dictionary. So it might be useful feature Vote: 0 Page Level Inverted Index: -1 If user know column on which he/she is going to use IN filter it is very useful Lucene datamap: Performance is bad because of some code/design issue which can be fixed -1 And there are some internal refactory we can do: 1. Unify dimension and measure: It may improve IO performance but effort is high. 0 3. Spark integration refactory based on Spark extension interface +1 4. Store optimization PR2729 +1 -Regards Kumar Vishal On Thu, Dec 5, 2019 at 3:28 PM Jacky Li <[hidden email]> wrote: > Hi, > > Thanks for all your input, the voting summary is as below: > > 1. Global dictionary > No -1 > > 2. Bucket > Two -1 > > 3. Carbon custom partition > No -1 > > 4. BATCH_SORT > No -1 > > 5. Page level inverse index > One -1 > > 5. old preaggregate and time series datamap implementation > No -1 > > 6. Lucene datamap > Five -1 > > 7. STORED BY > No -1 > > So, I have created an umbrella JIRA (CARBONDATA-3603) for these items. > Please feel free to response if anyone interested working on them > > Regards, > Jacky > > > > > > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >
kumar vishal
|
In reply to this post by Jacky Li
Hi,
1. Global Dict - 0 2. Bucket and 3. Cutstom Partition +1 4. Batch sort +1 5 page level 0 6. preaggregate and old timeseries + 1 7. stored by +1 Store optimization +1 I also suggest the refactoring below: [DISCUSSION] Segment file improvement for Update and delete case. you can find the problem statement in below discussion thread. https://lists.apache.org/list.html?dev@...:2019-09 Regards, Akash On 2019/11/28 16:32:57, Jacky Li <[hidden email]> wrote: > > Hi Community, > > As we are moving to CarbonData 2.0, in order to keep the project moving > forward fast and stable, it is necessary to do some refactory and clean up > obsoleted features before introducing new features. > > To do that, I propose making following features obsoleted and not supported > since 2.0. In my opinion, these features are seldom used. > > 1. Global dictionary > After spark 2.x, the aggregation is much faster since project tungsten, so > Global Dictionary is not much useful but it makes data loading slow and need > very complex SQL plan transformation. > > 2. Bucket > Bucket feature of carbon is intented to improve join performance, but actual > improvement is very limited > > 3. Carbon custom partition > Since now we have Hive standard partition, old custom partition is not very > useful > > 4. BATCH_SORT > I have not seen anyone use this feature > > 5. Page level inverse index > This is arguable, I understand in a very specific scenario (when there are > many columns in IN filter) it has benefit, but it slow down the data loading > and make encoding code very complex > > 5. old preaggregate and time series datamap implementation > As we have introduced MV, these two features can be dropped. And we can > following the standard SQL to have a new syntax to create MV: CREATE > MATERIALIZED VIEW > > 6. Lucene datamap > This feature is not well implemented, as it will read too much index into > memroy thus creating memory problems in most cases. > > 7. STORED BY > We should follow either Hive sytanx (STORED AS) or SparkSQL syntax (USING). > > > And there are some internal refactory we can do: > 1. Unify dimension and measure > > 2. Keep the column order the same as schema order > > 3. Spark integration refactory based on Spark extension interface > > 4. Store optimization PR2729 > > > The aim of this proposal is to make CarbonData code cleaner and reduce > community's maitenance effort. > What do you think of it? > > > Regards, > Jacky > > > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |