Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] incubator-carbondata pull request #635: [WIP]support SORT_COLUMNS

Classic

List

56 messages Options

Options

123

[GitHub] incubator-carbondata pull request #635: [WIP]support SORT_COLUMNS

GitHub user QiangCai opened a pull request:

https://github.com/apache/incubator-carbondata/pull/635

[WIP]support SORT_COLUMNS

1. create table with sort_columns
e.g. tblproperties('sort_columns' = 'col7,col3')
The columns of sort_columsn will be at the begin of all columns.
sort_columns support all primitive datatype.

2. loading
sort by sort_columns
sort columns: sorted and rowid index
other dimension: no sorted and no rowid index

3. compaction
sort by sort_columns

4. filter on sort_columns
filter no dictionary support all primitive data type

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/QiangCai/incubator-carbondata sortkey

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/635.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #635

----
commit ad4a3d59c3c0f5cae45f1bd0333718c7c8ada62e
Author: QiangCai <[hidden email]>
Date: 2017-03-02T09:48:54Z

sort columns

----

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1041/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@QiangCai I have few doubts.
Why primitive data types are supported as no-dictionary columns in this PR? It supposed to be direct dictionary.
Why date and timestamp are supported in no-dictionary, it already has direct dictionary support and it much efficient in terms of loading and query.

I think the scope of this PR should be limited to following points.
1. Support Sort_columns in DDL and metadata.
2. Already in old flow all columns with dictionary_include and dictionary_exclude will become sort_columns and remaining are measures . So now there would not be any measure concept now so we just make sort_columns should have sorted and rowid index, and remaining columns should not be sorted/ row index but it should have value/delta compression if it is number datatype.

I feel it would have been better if we have some discussion in mailing list before starting the implementation to keep the people sync with you and it avoids unnecessary rework.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1047/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user QiangCai commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@ravipesala good suggestion. Direct dicitonary is better than no dictioanry. I will add it.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@QiangCai Please mention what are the tasks you are doing in this PR. It is better to stick only supporting sort_columns in this PR. Other tasks can be pushed to other PRs.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user QiangCai commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@ravipesala I have listed the tasks.
Better to implement another direct-dictionary encoding for numeric datatype column. We can remove the dimension and measure concept, and only use column concept. The encoding of a column will be decided by the datatype of this column and table properties.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user QiangCai commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@ravipesala
Is it neccessary to limit that the sort_columns should come from dimensions?
If the table need be sorted by a measure, we should use dictionary_include to add it to dimension list.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user kumarvishal09 commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@QiangCai I have queries related to this PR.
1. If user has not mentioned any sort column then it will go to old flow (sorting based on all dimension column) or data wont be sorted ?
2. If data is not sorted We cannot use B+ tree we need to use some other linear data structure like array or linked list, i have not seen any changes related to this.
2. Btree is created based on sort column, so based on this pr we need to update the btree loading as only sort column will participate on creating the Btree.
3. How you creating start key and end key as only sort column can participate on both the keys. Btree jump will not work if other columns (except sort columns) are participating in start and end key.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1078/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1140/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Failed with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1171/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1172/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1174/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [WIP]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user QiangCai commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

@kumarvishal09
1. If user has not mentioned any sort column then it will go to old flow, sorting based on all dimension column
2. yes
3. During dataloading, the start/end key of blocklet info contain only sort columns.
4. For dataloading, just use sort columns to build start/end key of blocklet info.
Code line: CarbonFactDataHandlerColumnar.java 1041
For select query, juse use sort columns to bulid start/end key of filters.
Code line: FilterUtil.java 1159 and 1206

@ravipesala
I have remove date & timestamp datatype from no-dcitonary.
Better to raise another pr to implement new numeric datatype encoding.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [CARBONDATA-782]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user QiangCai commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Data Records : 1 * 1000 * 1000
```
SORT BY ALL DIMENSION: c1,c2,c3,c4,c5

CREATE TABLE IF NOT EXISTS default.carbon_perftest_table
(c1 STRING, c2 STRING, c3 STRING, c4 STRING, c5 STRING, c6 INT, c7 INT, c8 INT, c9 INT, c10 INT)
STORED BY 'org.apache.carbondata.format'

LOAD DATA INPATH '/home/david/Documents/incubator-carbondata/examples/spark/target/store/tempCSV_default_carbon_perftest_table_4113107653051'
INTO TABLE default.carbon_perftest_table
OPTIONS ('FILEHEADER' = 'c1,c2,c3,c4,c5,c6,c7,c8,c9,c10', 'USE_KETTLE' = 'false')

load performance: 1782, 3217, 10171
OLAP Query 0: 736, 1690, 664 [sql: SELECT c3, c4, sum(c8) FROM tableName WHERE c1 = 'P1_23' and c2 = 'P2_43' GROUP BY c3, c4]
OLAP Query 1: 527, 1874, 542 [sql: SELECT c2, c3, sum(c9) FROM tableName WHERE c1 = 'P1_432' and c4 = 'P4_3' and c5 = 'P5_2' GROUP by c2, c3 ]
OLAP Query 2: 3088, 3973, 2996 [sql: SELECT c2, count(distinct c1), sum(c8) FROM tableName WHERE c3="P3_4" and c5="P5_4" GROUP BY c2 ]
OLAP Query 3: 2493, 3710, 2622 [sql: SELECT c2, c5, count(distinct c1), sum(c7) FROM tableName WHERE c4="P4_4" and c5="P5_7" and c8>4 GROUP BY c2, c5 ]
Point Query 0: 114, 516, 98 [sql: SELECT c4 FROM tableName WHERE c1="P1_43" ]
Point Query 1: 126, 664, 99 [sql: SELECT c3 FROM tableName WHERE c1="P1_542" and c2="P2_23" ]
Point Query 2: 128, 817, 165 [sql: SELECT c3, c5 FROM tableName WHERE c1="P1_52" and c7=4]
Point Query 3: 113, 530, 155 [sql: SELECT c4, c9 FROM tableName WHERE c1="P1_43" and c8<3]
Filter Query 0: 209, 1319, 154 [sql: SELECT * FROM tableName WHERE c2="P2_43" ]
Filter Query 1: 283, 1686, 289 [sql: SELECT * FROM tableName WHERE c3="P3_3" ]
Filter Query 2: 319, 1306, 137 [sql: SELECT * FROM tableName WHERE c2="P2_32" and c3="P3_23" ]
Filter Query 3: 234, 1242, 154 [sql: SELECT * FROM tableName WHERE c3="P3_28" and c4="P4_3" ]
Scan Query 0: 162, 318, 327 [sql: SELECT sum(c7), sum(c8), avg(c9), max(c10) FROM tableName ]
Scan Query 1: 107, 406, 97 [sql: SELECT sum(c7) FROM tableName WHERE c2="P2_32" ]
Scan Query 2: 157, 546, 141 [sql: SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM tableName WHERE c4="P4_4" ]
Scan Query 3: 121, 480, 170 [sql: SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM tableName WHERE c2="P2_75" and c6<5 ]
Total time: 8924.109771, 21083.679769, 8817.163649

SORT_COLUMNS: c1,c3

CREATE TABLE IF NOT EXISTS default.carbon_perftest_table
(c1 STRING, c2 STRING, c3 STRING, c4 STRING, c5 STRING, c6 INT, c7 INT, c8 INT, c9 INT, c10 INT)
STORED BY 'org.apache.carbondata.format'
TBLPROPERTIES('SORT_COLUMNS'='c1,c3')

LOAD DATA INPATH '/home/david/Documents/incubator-carbondata/examples/spark/target/store/tempCSV_default_carbon_perftest_table_4448597034063'
INTO TABLE default.carbon_perftest_table
OPTIONS ('FILEHEADER' = 'c1,c2,c3,c4,c5,c6,c7,c8,c9,c10', 'USE_KETTLE' = 'false')

load performance: 1649, 3108, 9070
OLAP Query 0: 651, 1567, 615 [sql: SELECT c3, c4, sum(c8) FROM tableName WHERE c1 = 'P1_23' and c2 = 'P2_43' GROUP BY c3, c4]
OLAP Query 1: 502, 1792, 448 [sql: SELECT c2, c3, sum(c9) FROM tableName WHERE c1 = 'P1_432' and c4 = 'P4_3' and c5 = 'P5_2' GROUP by c2, c3 ]
OLAP Query 2: 3028, 3741, 2600 [sql: SELECT c2, count(distinct c1), sum(c8) FROM tableName WHERE c3="P3_4" and c5="P5_4" GROUP BY c2 ]
OLAP Query 3: 2535, 3777, 2704 [sql: SELECT c2, c5, count(distinct c1), sum(c7) FROM tableName WHERE c4="P4_4" and c5="P5_7" and c8>4 GROUP BY c2, c5 ]
Point Query 0: 107, 566, 82 [sql: SELECT c4 FROM tableName WHERE c1="P1_43" ]
Point Query 1: 158, 681, 96 [sql: SELECT c3 FROM tableName WHERE c1="P1_542" and c2="P2_23" ]
Point Query 2: 151, 747, 149 [sql: SELECT c3, c5 FROM tableName WHERE c1="P1_52" and c7=4]
Point Query 3: 128, 530, 141 [sql: SELECT c4, c9 FROM tableName WHERE c1="P1_43" and c8<3]
Filter Query 0: 212, 1292, 124 [sql: SELECT * FROM tableName WHERE c2="P2_43" ]
Filter Query 1: 214, 1271, 329 [sql: SELECT * FROM tableName WHERE c3="P3_3" ]
Filter Query 2: 203, 1216, 102 [sql: SELECT * FROM tableName WHERE c2="P2_32" and c3="P3_23" ]
Filter Query 3: 274, 1256, 108 [sql: SELECT * FROM tableName WHERE c3="P3_28" and c4="P4_3" ]
Scan Query 0: 152, 345, 306 [sql: SELECT sum(c7), sum(c8), avg(c9), max(c10) FROM tableName ]
Scan Query 1: 133, 344, 86 [sql: SELECT sum(c7) FROM tableName WHERE c2="P2_32" ]
Scan Query 2: 122, 485, 126 [sql: SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM tableName WHERE c4="P4_4" ]
Scan Query 3: 141, 451, 168 [sql: SELECT sum(c7), sum(c8), sum(9), sum(c10) FROM tableName WHERE c2="P2_75" and c6<5 ]
Total time: 8718.807424, 20070.015716, 8191.425242
```

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [CARBONDATA-782]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1185/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [CARBONDATA-782]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1202/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #635: [CARBONDATA-782]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/incubator-carbondata/pull/635

Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1212/

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #635: [CARBONDATA-782]support SORT_COLUMNS

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/635#discussion_r106802856

--- Diff: core/src/main/java/org/apache/carbondata/core/metadata/schema/table/CarbonTable.java ---
@@ -153,6 +164,21 @@ public void loadCarbonTable(TableInfo tableInfo) {
tableInfo.getFactTable().getBucketingInfo());
}

+ private void parseSortColumns(TableSchema tableSchema) {
+ Map<String, String> tableProperties = tableSchema.getTableProperties();
+ if (tableProperties != null) {
+ String sortColumnsString = tableProperties.get(CarbonCommonConstants.SORT_COLUMNS);
+ if (sortColumnsString != null) {
+ numberOfSortColumns = sortColumnsString.split(",").length;
+ for (int i = 0; i < numberOfSortColumns; i++) {
+ if (!tableSchema.getListOfColumns().get(i).hasEncoding(Encoding.DICTIONARY)) {
+ numberOfNoDictSortColumns++;
--- End diff --

Are you sure that sortcolumns and tableSchema columns are in same order?
I think it is better to check the equals comparison instead of assumption.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---

123