Apache CarbonData Dev Mailing List archive

[Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

Classic

List

Threaded

8 messages Options

Ajantha Bhat

Dec 11, 2018; 1:31pm

[Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

Hi all,
Currently in carbondata, we have 'local_sort' as default sort_scope and by
default, all the dimension columns are selected for sort_columns.
This will slow down the data loading.
*To give the best performance benefit to user by default values, *
we can change sort_scope to 'no_sort' and stop using all dimensions for
sort_columns by default.
Also if sort_columns are specified but sort_scope is not specified by the
user, implicitly need to consider scort_scope as 'local_sort'.
These default values are applicable for carbonsession, spark file format
and SDK also. (all will have the same behavior)

With these changes below is the performance results of TPCH queries on
500GB data

** Load time is improved nearly by 4 times. * total Query time by all
queries is improved. (50% of queries are faster with no_sort, other 50%
queries are slightly degraded or same. overall better performance)*
Also when I did this change, I found few major issues from existing code in
'no_sort' and empty sort_columns flow. I have fixed that also.
Below are the issues found,

*[CARBONDATA-3162] Range filters don't remove null values for no_sort
direct dictionary dimension columns. [CARBONDATA-3163] If table has
different time format, for no_sort columns data goes as bad record (null)
for second table when loaded after first table.[CARBONDATA-3164] During
no_sort, exception happened at converter step is not reaching to user. same
problem in SDK and spark file format flow also.Also fixed multiple test
case issues.*
I have already opened a PR for fixing these issues.
https://github.com/apache/carbondata/pull/2966

Let me know if any suggestions about these changes.

Thanks,
Ajantha

Liang Chen

Dec 15, 2018; 9:51am

Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

Administrator

Hi

First, let me understand your propoal,you mean :
1, If user defines the "sort_columns=columns" : all behaviors are same as
the current, no any change.(most of users will set this key option during
create carbondata table)
2, If user doesn't define the "sort_columns" : current default behavior: all
the dimension columns are selected for sort_columns, sort_scope is
local_sort : *you propose to change this default behavior,use the no_sort,
right ?*

if yes, I agree with this proposal. and propose to remove "empty
sort_column" option. *it would be more easy for users to understand. If
define the sort_column, use the local_sort, if don't define the sort_column,
use the no_sort.*

Regards
Liang

Ajantha Bhat wrote

> Hi all,
> Currently in carbondata, we have 'local_sort' as default sort_scope and by
> default, all the dimension columns are selected for sort_columns.
> This will slow down the data loading.
> *To give the best performance benefit to user by default values, *
> we can change sort_scope to 'no_sort' and stop using all dimensions for
> sort_columns by default.
> Also if sort_columns are specified but sort_scope is not specified by the
> user, implicitly need to consider scort_scope as 'local_sort'.
> These default values are applicable for carbonsession, spark file format
> and SDK also. (all will have the same behavior)
>
> With these changes below is the performance results of TPCH queries on
> 500GB data
>
>
>
> ** Load time is improved nearly by 4 times. * total Query time by all
> queries is improved. (50% of queries are faster with no_sort, other 50%
> queries are slightly degraded or same. overall better performance)*
> Also when I did this change, I found few major issues from existing code
> in
> 'no_sort' and empty sort_columns flow. I have fixed that also.
> Below are the issues found,
>
>
>
>
> *[CARBONDATA-3162] Range filters don't remove null values for no_sort
> direct dictionary dimension columns. [CARBONDATA-3163] If table has
> different time format, for no_sort columns data goes as bad record (null)
> for second table when loaded after first table.[CARBONDATA-3164] During
> no_sort, exception happened at converter step is not reaching to user.
> same
> problem in SDK and spark file format flow also.Also fixed multiple test
> case issues.*
> I have already opened a PR for fixing these issues.
> https://github.com/apache/carbondata/pull/2966
>
> Let me know if any suggestions about these changes.
>
> Thanks,
> Ajantha

... [show rest of quote]

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Dec 17, 2018; 1:22am

Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

In reply to this post by Ajantha Bhat

I think we can just rephrase the proposal.

We want to make the `sort_columns` by default is empty, that is to say if
the user does not explicitly specify the sort_columns, the corresponding
property will be 'sort_columns'=''.
And when the sort_columns is empty, carbondata will use no_sort for it --
This strategy is already there.

So if my understanding is correct, please use the above statements to make
that PR more clear and understandable.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

David CaiQiang

Dec 17, 2018; 1:47am

Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

In reply to this post by Ajantha Bhat

Better to support alter 'sort_columns' and 'sort_scope' also.

After the table creation and data loading, the user can adjust
'sort_columns' and 'sort_scope'.

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

Ajantha Bhat

Dec 17, 2018; 8:58am

Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

@Liang: yes, your understanding of my proposal is correct.
Why remove empty sort_columns? if user specifies empty sort columns, I
should throw an exception saying sort_columns specified not present?
I feel no need to remove empty sort columns, by default we set sort_columns
as empty sort_columns internally.

@xuchuanyin: yes, that's all. But I also want to change
CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT, because if some place if
sort_scope is displayed or addressed without referring sort_columns. I want
to show default as NO_SORT

@david: I will check about this use case and development scope of this
version. If required, I will do it in a separate PR.

Thanks,
Ajantha

On Mon, Dec 17, 2018 at 7:17 AM David CaiQiang <[hidden email]> wrote:

> Better to support alter 'sort_columns' and 'sort_scope' also.
>
> After the table creation and data loading, the user can adjust
> 'sort_columns' and 'sort_scope'.
>
>
>
>
>
>
> -----
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

... [show rest of quote]

manishgupta88

Dec 17, 2018; 9:11am

Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

Hi Ajantha

+1 for the proposal.

1. I agree with Liang to remove empty SORT_COLUMNS option. This will give
more calrity to the user about the property behavior. If configured we use
LOCAL_SORT else we use NO_SORT. Internal behavior you can keep anything as
per the implementation, it need nnot be exposed to the user.
2. For David's Suggestion, I feel it is a very high level statement to
support altering of SORT_COLUMNS. This will impact your compaction operation
also whereIn we will have to decide the what sort_columns to consider. I
feel as part of this proposal we should not consider altering sort_columns.
We need to think well on this and come up with a separate design.

Regards
Manish Gupta

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ravipesala

Dec 17, 2018; 11:08am

Re: [Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

Hi,

+1 for making 'no_sort' as default sort_scope

1. Regarding removing empty SORT_COLUMNS option, I don't think we change the
current behaviour as already some users might be using it in their script,
so if we remove empty SORT_COLUMNS option then their scripts start failing
after upgrade. It is better to make that as deprecated. In the future major
releases, we can remove all deprecated options.

2. Regarding David's suggestion, we cannot change the sort columns as it
impacts the compaction. But we can change the sort _scope. So I think it is
better we consider only updating the sort_scope through alter command.

Regards,
Ravindra.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Dec 17, 2018; 12:01pm

RE: [Discussion] Make 'no_sort' as default sort_scope and keepsort_columns as 'empty' by default

I think the no_sort is default only in case if the user doesnot specify the sort_columns explicitly. Not for all the scenarios, right?

+1 for keeping the ‘sort_columns’ unchanged cause the fields in sort_columns have different encoding strategy compared with others.

@Ajantha, Please make a conclusion for these mails before you start to work.

===
Sent from laptop