Hi all,
Currently in carbondata, we have 'local_sort' as default sort_scope and by default, all the dimension columns are selected for sort_columns. This will slow down the data loading. *To give the best performance benefit to user by default values, * we can change sort_scope to 'no_sort' and stop using all dimensions for sort_columns by default. Also if sort_columns are specified but sort_scope is not specified by the user, implicitly need to consider scort_scope as 'local_sort'. These default values are applicable for carbonsession, spark file format and SDK also. (all will have the same behavior) With these changes below is the performance results of TPCH queries on 500GB data ** Load time is improved nearly by 4 times. * total Query time by all queries is improved. (50% of queries are faster with no_sort, other 50% queries are slightly degraded or same. overall better performance)* Also when I did this change, I found few major issues from existing code in 'no_sort' and empty sort_columns flow. I have fixed that also. Below are the issues found, *[CARBONDATA-3162] Range filters don't remove null values for no_sort direct dictionary dimension columns. [CARBONDATA-3163] If table has different time format, for no_sort columns data goes as bad record (null) for second table when loaded after first table.[CARBONDATA-3164] During no_sort, exception happened at converter step is not reaching to user. same problem in SDK and spark file format flow also.Also fixed multiple test case issues.* I have already opened a PR for fixing these issues. https://github.com/apache/carbondata/pull/2966 Let me know if any suggestions about these changes. Thanks, Ajantha |
Administrator
|
Hi
First, let me understand your propoal,you mean : 1, If user defines the "sort_columns=columns" : all behaviors are same as the current, no any change.(most of users will set this key option during create carbondata table) 2, If user doesn't define the "sort_columns" : current default behavior: all the dimension columns are selected for sort_columns, sort_scope is local_sort : *you propose to change this default behavior,use the no_sort, right ?* if yes, I agree with this proposal. and propose to remove "empty sort_column" option. *it would be more easy for users to understand. If define the sort_column, use the local_sort, if don't define the sort_column, use the no_sort.* Regards Liang Ajantha Bhat wrote > Hi all, > Currently in carbondata, we have 'local_sort' as default sort_scope and by > default, all the dimension columns are selected for sort_columns. > This will slow down the data loading. > *To give the best performance benefit to user by default values, * > we can change sort_scope to 'no_sort' and stop using all dimensions for > sort_columns by default. > Also if sort_columns are specified but sort_scope is not specified by the > user, implicitly need to consider scort_scope as 'local_sort'. > These default values are applicable for carbonsession, spark file format > and SDK also. (all will have the same behavior) > > With these changes below is the performance results of TPCH queries on > 500GB data > > > > ** Load time is improved nearly by 4 times. * total Query time by all > queries is improved. (50% of queries are faster with no_sort, other 50% > queries are slightly degraded or same. overall better performance)* > Also when I did this change, I found few major issues from existing code > in > 'no_sort' and empty sort_columns flow. I have fixed that also. > Below are the issues found, > > > > > *[CARBONDATA-3162] Range filters don't remove null values for no_sort > direct dictionary dimension columns. [CARBONDATA-3163] If table has > different time format, for no_sort columns data goes as bad record (null) > for second table when loaded after first table.[CARBONDATA-3164] During > no_sort, exception happened at converter step is not reaching to user. > same > problem in SDK and spark file format flow also.Also fixed multiple test > case issues.* > I have already opened a PR for fixing these issues. > https://github.com/apache/carbondata/pull/2966 > > Let me know if any suggestions about these changes. > > Thanks, > Ajantha -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Ajantha Bhat
I think we can just rephrase the proposal.
We want to make the `sort_columns` by default is empty, that is to say if the user does not explicitly specify the sort_columns, the corresponding property will be 'sort_columns'=''. And when the sort_columns is empty, carbondata will use no_sort for it -- This strategy is already there. So if my understanding is correct, please use the above statements to make that PR more clear and understandable. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
In reply to this post by Ajantha Bhat
Better to support alter 'sort_columns' and 'sort_scope' also.
After the table creation and data loading, the user can adjust 'sort_columns' and 'sort_scope'. ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai |
@Liang: yes, your understanding of my proposal is correct.
Why remove empty sort_columns? if user specifies empty sort columns, I should throw an exception saying sort_columns specified not present? I feel no need to remove empty sort columns, by default we set sort_columns as empty sort_columns internally. @xuchuanyin: yes, that's all. But I also want to change CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT, because if some place if sort_scope is displayed or addressed without referring sort_columns. I want to show default as NO_SORT @david: I will check about this use case and development scope of this version. If required, I will do it in a separate PR. Thanks, Ajantha On Mon, Dec 17, 2018 at 7:17 AM David CaiQiang <[hidden email]> wrote: > Better to support alter 'sort_columns' and 'sort_scope' also. > > After the table creation and data loading, the user can adjust > 'sort_columns' and 'sort_scope'. > > > > > > > ----- > Best Regards > David Cai > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Hi Ajantha
+1 for the proposal. 1. I agree with Liang to remove empty SORT_COLUMNS option. This will give more calrity to the user about the property behavior. If configured we use LOCAL_SORT else we use NO_SORT. Internal behavior you can keep anything as per the implementation, it need nnot be exposed to the user. 2. For David's Suggestion, I feel it is a very high level statement to support altering of SORT_COLUMNS. This will impact your compaction operation also whereIn we will have to decide the what sort_columns to consider. I feel as part of this proposal we should not consider altering sort_columns. We need to think well on this and come up with a separate design. Regards Manish Gupta -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi,
+1 for making 'no_sort' as default sort_scope 1. Regarding removing empty SORT_COLUMNS option, I don't think we change the current behaviour as already some users might be using it in their script, so if we remove empty SORT_COLUMNS option then their scripts start failing after upgrade. It is better to make that as deprecated. In the future major releases, we can remove all deprecated options. 2. Regarding David's suggestion, we cannot change the sort columns as it impacts the compaction. But we can change the sort _scope. So I think it is better we consider only updating the sort_scope through alter command. Regards, Ravindra. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
I think the no_sort is default only in case if the user doesnot specify the sort_columns explicitly. Not for all the scenarios, right?
+1 for keeping the ‘sort_columns’ unchanged cause the fields in sort_columns have different encoding strategy compared with others. @Ajantha, Please make a conclusion for these mails before you start to work. === Sent from laptop |
Free forum by Nabble | Edit this page |