GitHub user chenliang613 opened a pull request:
https://github.com/apache/carbondata/pull/1534 [CARBONDATA-1770] Update documents and consolidate DDL,DML,Partition docs 1. Update documents : there are some error description. 2. Consolidate Data management, DDL,DML,Partition docs, to ensure one feature which only be described in one place. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [X] Any interfaces changed? NA - [X] Any backward compatibility impacted? NA - [X] Document update required? YES - [X] Testing done NA - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. YES You can merge this pull request into a Git repository by running: $ git pull https://github.com/chenliang613/carbondata update_docs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1534.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1534 ---- commit a0333be14051166072fb9865dc0623ee1473c92e Author: chenliang613 <[hidden email]> Date: 2017-11-19T13:12:11Z [CARBONDATA-1770] Update documents and consolidate DDL,DML,Partition docs ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1534 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1286/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1534 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/1756/ --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on the issue:
https://github.com/apache/carbondata/pull/1534 @chenliang613 kindly fins my comments The following description can be added for user to know what it does. Description about Minor & Major compaction. Description for Partition and types. --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on the issue:
https://github.com/apache/carbondata/pull/1534 @sgururajshetty ok --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1534 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1318/ --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152480015 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). --- End diff -- Compaction improves the query performance significantly. During the load data, several CarbonData files are generated, this is because data is sorted only within each load (per load segment and one B+ tree index). --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152480127 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. --- End diff -- There are two types of copaction, Minor and Major compaction. --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152480183 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. --- End diff -- In Minor compaction, user can specify the number of loads to be merged. --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152480386 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. + Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. + If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction: + * Level 1: Merging of the segments which are not yet compacted. + * Level 2: Merging of the compacted segments again to form a bigger segment. --- End diff -- Level 2: Merging of the compacted segments again to form a larger segment. --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152480438 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. + Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. + If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction: + * Level 1: Merging of the segments which are not yet compacted. + * Level 2: Merging of the compacted segments again to form a bigger segment. + ``` ALTER TABLE table_name COMPACT 'MINOR' ``` - **Major Compaction** + + In Major compaction, many segments can be merged into one big segment. --- End diff -- In Major compaction, multiple segments can be merged into one large segment. --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152480541 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. + Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. + If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction: + * Level 1: Merging of the segments which are not yet compacted. + * Level 2: Merging of the compacted segments again to form a bigger segment. + ``` ALTER TABLE table_name COMPACT 'MINOR' ``` - **Major Compaction** + + In Major compaction, many segments can be merged into one big segment. + User will specify the compaction size until which segments can be merged, Major compaction is usually done during the off-peak time. + This command merges the specified number of segments into one segment: + ``` ALTER TABLE table_name COMPACT 'MAJOR' ``` ## PARTITION + Similar other system's partition features, CarbonData's partition feature can be used to improve query performance by filtering on the partition column. --- End diff -- Similar to other system's partition features, CarbonData's partition feature also can be used to improve query performance by filtering on the partition column. --- |
In reply to this post by qiuchenjian-2
Github user sgururajshetty commented on the issue:
https://github.com/apache/carbondata/pull/1534 LGTM --- |
In reply to this post by qiuchenjian-2
Github user vandana7 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152548815 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. --- End diff -- add a Note tag before writing 512 or 512M both are accepted. as "//" are used in the code for making notes or comments --- |
In reply to this post by qiuchenjian-2
Github user vandana7 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152549583 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. + ``` + +### Example: + ``` + CREATE TABLE IF NOT EXISTS productSchema.productSalesTable ( + productNumber Int, + productName String, + storeCity String, + storeProvince String, + productCategory String, + productBatch String, + saleQuantity Int, + revenue Int) + STORED BY 'carbondata' + TBLPROPERTIES ('DICTIONARY_INCLUDE'='productNumber', + 'NO_INVERTED_INDEX'='productBatch', + 'SORT_COLUMNS'='productName,storeCity', + 'SORT_SCOPE'='NO_SORT', + 'TABLE_BLOCKSIZE'='512') + ``` + +## TABLE MANAGEMENT + +### SHOW TABLE + + This command can be used to list all the tables in current database or all the tables of a specific database. + ``` + SHOW TABLES [IN db_Name] + ``` + + Example: + ``` + SHOT TABLES --- End diff -- SHOW TABLES --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1534 SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/1824/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/1534 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1372/ --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152589597 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. + ``` + +### Example: + ``` + CREATE TABLE IF NOT EXISTS productSchema.productSalesTable ( + productNumber Int, + productName String, + storeCity String, + storeProvince String, + productCategory String, + productBatch String, + saleQuantity Int, + revenue Int) + STORED BY 'carbondata' + TBLPROPERTIES ('DICTIONARY_INCLUDE'='productNumber', + 'NO_INVERTED_INDEX'='productBatch', + 'SORT_COLUMNS'='productName,storeCity', + 'SORT_SCOPE'='NO_SORT', + 'TABLE_BLOCKSIZE'='512') + ``` + +## TABLE MANAGEMENT + +### SHOW TABLE + + This command can be used to list all the tables in current database or all the tables of a specific database. + ``` + SHOW TABLES [IN db_Name] + ``` + + Example: + ``` + SHOT TABLES --- End diff -- fixed --- |
In reply to this post by qiuchenjian-2
Github user chenliang613 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/1534#discussion_r152589826 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. --- End diff -- accept, fixed. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/1534 LGTM --- |
Free forum by Nabble | Edit this page |