Apache CarbonData Dev Mailing List archive

[DISCUSSION] Unify the sort column and sort scope in create table command

Classic

List

Threaded

3 messages Options

Erlu Chen

[DISCUSSION] Unify the sort column and sort scope in create table command

This post was updated on .

Hi,dev

I am working on the following requirement and I hope community can give some ideas on my two options.

Background.

1 Requirement
Currently, Users can specify sort column in table properties when create
table. And when load data, users can also specify sort scope in load
options.
In order to improve the ease of use for users, it will be better to specify
the sort related parameters all in create table command.
Once sort scope is specified in create table command, it will be used in
load data even users have specified in load options.

2 Detailed design

2.1 Task-01
Requirement： Create table can support specify sort scope
Implement: Take use of table properties (Map<String, String>), will specify
sort scope in table properties by key/value pair, then existing interface
will be called to write this key/value pair into metastore.
Will support Global Sort，Local Sort and No Sort，it can be specified in sql
command:
CREATE TABLE tableWithGlobalSort (
shortField SHORT,
intField INT,
bigintField LONG,
doubleField DOUBLE,
stringField STRING,
timestampField TIMESTAMP,
decimalField DECIMAL(18,2),
dateField DATE,
charField CHAR(5)
)
STORED BY 'carbondata'
TBLPROPERTIES('SORT_COLUMNS'='stringField', 'SORT_SCOPE'='GLOBAL_SORT')
Tips：If the sort scope is global Sort, users should specify
GLOBAL_SORT_PARTITIONS. If users do not specify it, it will use the number
of map task. GLOBAL_SORT_PARTITIONS should be Integer type, the range is
[1,Integer.MaxValue]，it is only used when the sort scope is global sort.
Global Sort Use orderby operator in spark, data is ordered in segment level.
Local Sort Node ordered, carbondata file is ordered if it is written by one
task.
No Sort No sort
Tips：key and value is case-insensitive.

2.2 Task-02
Requirement:
Load data in will support local sort, no sort, global sort
Ignore the sort scope specified in load data and use the parameter which
specified in create table.
Currently, user can specify the sort scope and global sort partitions in
load options, After modification, it will ignore the sort scope which
specified in load options and will get sort scope from table properties.
Current logic: sort scope is from load options
Number Prerequisite Sort scope
1 isSortTable is true && Sort Scope is Global Sort Global Sort(first check)
2 isSortTable is false No Sort
3 isSortTable is true Local Sort
Tips: isSortTable is true means this table contains sort column or it
contains dimensions (except complex type), like string type.
For example:
Create table xxx1 (col1 string col2 int) stored by ‘carbondata’ — sort table
Create table xx1 (col1 int, col2 int) stored by ‘carbondata’ — not sort
table
Create table xx (col1 int, col2 string) stored by ‘carbondata’ tblproperties
(‘sort_column’=’col1’) –- sort table
New logic：sort scope is from create table
Number Prerequisite Code branch
1 isSortTable = true && Sort Scope is Global Sort Global Sort(first check)
2 isSortTable= false || Sort Scope is No Sort No Sort
3 isSortTable is true && Sort Scope is Local Sort Local Sort
4 isSortTable is true，without specify Sort Scope Local Sort, (Keep current
logic)

3 Acceptance standard
Number Acceptance standard
1 Use can specify sort scope(global, local, no sort) when create carbon
table in sql type
2 Load data will ignore the sort scope specified in load options and will
use the parameter which specify in create table command. If user still
specify the sort scope in load options, will give warning and inform user
that he will use the sort scope which specified in create table.

Here is my JIRA: https://issues.apache.org/jira/browse/CARBONDATA-1438

You can see my simple design above.

But I am indecisive about two options when load data with sort scope
specified.

Option1: Same as the design document, just ignore the sort scope specified
in load options and give warning message, use the sort scope specified in
create table command, if create table without sort scope, it stilll never
use the the sort scope specified in load options.

Option2: The sort scope in create table command is in higher priority than
the the sort scope specified in load options, which means if create table
without sort scope, it will use the sort scope specified in load options.

I support first option currently, this option can be compatible and we can remove sort scope in options totally in the future, any idea about this two options ?

Regards.
Chenerlu.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin-2

Re: [DISCUSSION] Unify the sort column and sort scope in create table command

The two options both prefer to make all the sortscope in all segments (loads) same.
Since carbondata supports different sortscope in different segment (load), I think there should be a third option.

Option 3: The sortscope in load data command is in higher priority than that specified in create table command, which means the sortscope in create table command is a default value and will only be used if user doesn't specified it when loading data.

Options 3 will leave the user to make a balance between loading and querying performance. Users can use global sort as default scope and turn to local sort when encountering large amount data during peak periods. ----I am not sure whether this will be a complicated or advanced usage?

Besides, update is performed as a select followed by a load. So, what sort scope will this load use?

On 08/31/2017 17:45, Erlu Chen wrote:
1 Requirement
Currently, Users can specify sort column in table properties when create
table. And when load data, users can also specify sort scope in load
options.
In order to improve the ease of use for users, it will be better to specify
the sort related parameters all in create table command.
Once sort scope is specified in create table command, it will be used in
load data even users have specified in load options.
2 Detailed design
2.1 Task-01
Requirement： Create table can support specify sort scope
Implement: Take use of table properties (Map<String, String>), will specify
sort scope in table properties by key/value pair, then existing interface
will be called to write this key/value pair into metastore.
Will support Global Sort，Local Sort and No Sort，it can be specified in sql
command:
CREATE TABLE tableWithGlobalSort (
shortField SHORT,
intField INT,
bigintField LONG,
doubleField DOUBLE,
stringField STRING,
timestampField TIMESTAMP,
decimalField DECIMAL(18,2),
dateField DATE,
charField CHAR(5)
)
STORED BY 'carbondata'
TBLPROPERTIES('SORT_COLUMNS'='stringField', 'SORT_SCOPE'='GLOBAL_SORT')
Tips：If the sort scope is global Sort, users should specify
GLOBAL_SORT_PARTITIONS. If users do not specify it, it will use the number
of map task. GLOBAL_SORT_PARTITIONS should be Integer type, the range is
[1,Integer.MaxValue]，it is only used when the sort scope is global sort.
Global Sort Use orderby operator in spark, data is ordered in segment level.
Local Sort Node ordered, carbondata file is ordered if it is written by one
task.
No Sort No sort
Tips：key and value is case-insensitive.
2.2 Task-02
Requirement:
Load data in will support local sort, no sort, global sort
Ignore the sort scope specified in load data and use the parameter which
specified in create table.
Currently, user can specify the sort scope and global sort partitions in
load options, After modification, it will ignore the sort scope which
specified in load options and will get sort scope from table properties.
Current logic: sort scope is from load options
Number Prerequisite Sort scope
1 isSortTable is true && Sort Scope is Global Sort Global Sort(first check)
2 isSortTable is false No Sort
3 isSortTable is true Local Sort
Tips: isSortTable is true means this table contains sort column or it
contains dimensions (except complex type), like string type.
For example:
Create table xxx1 (col1 string col2 int) stored by ‘carbondata’ — sort table
Create table xx1 (col1 int, col2 int) stored by ‘carbondata’ — not sort
table
Create table xx (col1 int, col2 string) stored by ‘carbondata’ tblproperties
(‘sort_column’=’col1’) –- sort table
New logic：sort scope is from create table
Number Prerequisite Code branch
1 isSortTable = true && Sort Scope is Global Sort Global Sort(first check)
2 isSortTable= false || Sort Scope is No Sort No Sort
3 isSortTable is true && Sort Scope is Local Sort Local Sort
4 isSortTable is true，without specify Sort Scope Local Sort, (Keep current
logic)
3 Acceptance standard
Number Acceptance standard
1 Use can specify sort scope(global, local, no sort) when create carbon
table in sql type
2 Load data will ignore the sort scope specified in load options and will
use the parameter which specify in create table command. If user still
specify the sort scope in load options, will give warning and inform user
that he will use the sort scope which specified in create table.

Here is my JIRA: https://issues.apache.org/jira/browse/CARBONDATA-1438

You can see my simple design above.

But I am indecisive about two options when load data with sort scope
specified.

Option1: Same as the design document, just ignore the sort scope specified
in load options and give warning message, use the sort scope specified in
create table command, if create table without sort scope, it stilll never
use the the sort scope specified in load options.

Option2: The sort scope in create table command is in higher priority than
the the sort scope specified in load options, which means if create table
without sort scope, it will use the sort scope specified in load options.

Any idea about this two options ?

Regards.
Chenerlu.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

sraghunandan

Re: [DISCUSSION] Unify the sort column and sort scope in create table command

Sort scope is different. Should not be made during create alone. Supporting
during create is only like default value. Actual decision should be made
during load. It depends on the system load and the balance required between
load & query performance
On Thu, 31 Aug 2017 at 4:25 PM, xuchuanyin <[hidden email]> wrote:

> The two options both prefer to make all the sortscope in all segments
> (loads) same.
> Since carbondata supports different sortscope in different segment (load),
> I think there should be a third option.
>
> Option 3: The sortscope in load data command is in higher priority than
> that specified in create table command, which means the sortscope in create
> table command is a default value and will only be used if user doesn't
> specified it when loading data.
>
> Options 3 will leave the user to make a balance between loading and
> querying performance. Users can use global sort as default scope and turn
> to local sort when encountering large amount data during peak periods.
> ----I am not sure whether this will be a complicated or advanced usage?
>
> Besides, update is performed as a select followed by a load. So, what sort
> scope will this load use?
>
>
>
>
>
>
> On 08/31/2017 17:45, Erlu Chen wrote:
> 1 Requirement
> Currently, Users can specify sort column in table properties when create
> table. And when load data, users can also specify sort scope in load
> options.
> In order to improve the ease of use for users, it will be better to specify
> the sort related parameters all in create table command.
> Once sort scope is specified in create table command, it will be used in
> load data even users have specified in load options.
> 2 Detailed design
> 2.1 Task-01
> Requirement： Create table can support specify sort scope
> Implement: Take use of table properties (Map<String, String>), will specify
> sort scope in table properties by key/value pair, then existing interface
> will be called to write this key/value pair into metastore.
> Will support Global Sort，Local Sort and No Sort，it can be specified in sql
> command:
> CREATE TABLE tableWithGlobalSort (
> shortField SHORT,
> intField INT,
> bigintField LONG,
> doubleField DOUBLE,
> stringField STRING,
> timestampField TIMESTAMP,
> decimalField DECIMAL(18,2),
> dateField DATE,
> charField CHAR(5)
> )
> STORED BY 'carbondata'
> TBLPROPERTIES('SORT_COLUMNS'='stringField', 'SORT_SCOPE'='GLOBAL_SORT')
> Tips：If the sort scope is global Sort, users should specify
> GLOBAL_SORT_PARTITIONS. If users do not specify it, it will use the number
> of map task. GLOBAL_SORT_PARTITIONS should be Integer type, the range is
> [1,Integer.MaxValue]，it is only used when the sort scope is global sort.
> Global Sort Use orderby operator in spark, data is ordered in segment
> level.
> Local Sort Node ordered, carbondata file is ordered if it is written
> by one
> task.
> No Sort No sort
> Tips：key and value is case-insensitive.
> 2.2 Task-02
> Requirement:
> Load data in will support local sort, no sort, global sort
> Ignore the sort scope specified in load data and use the parameter which
> specified in create table.
> Currently, user can specify the sort scope and global sort partitions in
> load options, After modification, it will ignore the sort scope which
> specified in load options and will get sort scope from table properties.
> Current logic: sort scope is from load options
> Number Prerequisite Sort scope
> 1 isSortTable is true && Sort Scope is Global Sort Global
> Sort(first check)
> 2 isSortTable is false No Sort
> 3 isSortTable is true Local Sort
> Tips: isSortTable is true means this table contains sort column or it
> contains dimensions (except complex type), like string type.
> For example:
> Create table xxx1 (col1 string col2 int) stored by ‘carbondata’ — sort
> table
> Create table xx1 (col1 int, col2 int) stored by ‘carbondata’ — not sort
> table
> Create table xx (col1 int, col2 string) stored by ‘carbondata’
> tblproperties
> (‘sort_column’=’col1’) –- sort table
> New logic：sort scope is from create table
> Number Prerequisite Code branch
> 1 isSortTable = true && Sort Scope is Global Sort Global
> Sort(first check)
> 2 isSortTable= false || Sort Scope is No Sort No Sort
> 3 isSortTable is true && Sort Scope is Local Sort Local Sort
> 4 isSortTable is true，without specify Sort Scope Local Sort, (Keep
> current
> logic)
> 3 Acceptance standard
> Number Acceptance standard
> 1 Use can specify sort scope(global, local, no sort) when create carbon
> table in sql type
> 2 Load data will ignore the sort scope specified in load options and
> will
> use the parameter which specify in create table command. If user still
> specify the sort scope in load options, will give warning and inform user
> that he will use the sort scope which specified in create table.
>
> Here is my JIRA: https://issues.apache.org/jira/browse/CARBONDATA-1438
>
> You can see my simple design above.
>
> But I am indecisive about two options when load data with sort scope
> specified.
>
> Option1: Same as the design document, just ignore the sort scope specified
> in load options and give warning message, use the sort scope specified in
> create table command, if create table without sort scope, it stilll never
> use the the sort scope specified in load options.
>
> Option2: The sort scope in create table command is in higher priority than
> the the sort scope specified in load options, which means if create table
> without sort scope, it will use the sort scope specified in load options.
>
> Any idea about this two options ?
>
> Regards.
> Chenerlu.
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>