Login  Register

[Discussion] Make 'no_sort' as default sort_scope and keep sort_columns as 'empty' by default

Posted by Ajantha Bhat on Dec 11, 2018; 1:31pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Make-no-sort-as-default-sort-scope-and-keep-sort-columns-as-empty-by-default-tp70233.html

Hi all,
Currently in carbondata, we have 'local_sort' as default sort_scope and by
default, all the dimension columns are selected for sort_columns.
This will slow down the data loading.
*To give the best performance benefit to user by default values, *
we can change sort_scope to 'no_sort' and stop using all dimensions for
sort_columns by default.
Also if sort_columns are specified but sort_scope is not specified by the
user, implicitly need to consider scort_scope as 'local_sort'.
These default values are applicable for carbonsession, spark file format
and SDK also. (all will have the same behavior)

With these changes below is the performance results of TPCH queries on
500GB data



** Load time is improved nearly by 4 times. * total Query time by all
queries is improved. (50% of queries are faster with no_sort, other 50%
queries are slightly degraded or same. overall better performance)*
Also when I did this change, I found few major issues from existing code in
'no_sort' and empty sort_columns flow. I have fixed that also.
Below are the issues found,




*[CARBONDATA-3162] Range filters don't remove null values for no_sort
direct dictionary dimension columns. [CARBONDATA-3163] If table has
different time format, for no_sort columns data goes as bad record (null)
for second table when loaded after first table.[CARBONDATA-3164] During
no_sort, exception happened at converter step is not reaching to user. same
problem in SDK and spark file format flow also.Also fixed multiple test
case issues.*
I have already opened a PR for fixing these issues.
https://github.com/apache/carbondata/pull/2966

Let me know if any suggestions about these changes.

Thanks,
Ajantha