Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

Classic

List

92 messages Options

Options

12345

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

GitHub user chenerlu opened a pull request:

https://github.com/apache/carbondata/pull/1321

[CARBONDATA-1438] Unify the sort column and sort scope in create table command

Background:
In order to improve the ease of usage for users, unify the sort column and sort scope in create table command.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenerlu/incubator-carbondata pr-1438

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/1321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1321

----
commit c20caf973a70e6c6dc94d3cb26bf22404646e8eb
Author: chenerlu <[hidden email]>
Date: 2017-09-04T12:54:55Z

Unify the sort column and sort scope in create table command

----

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r136828909

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestGlobalSortDataLoad.scala ---
@@ -109,23 +118,22 @@ class TestGlobalSortDataLoad extends QueryTest with BeforeAndAfterEach with Befo
}

// ----------------------------------- Configuration Validity -----------------------------------
- test("Don't support GLOBAL_SORT on partitioned table") {
+ ignore("Don't support GLOBAL_SORT on partitioned table") {
--- End diff --

why ignore this testcase?

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r136829721

--- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala ---
@@ -639,6 +639,23 @@ case class LoadTable(
val carbonProperty: CarbonProperties = CarbonProperties.getInstance()
carbonProperty.addProperty("zookeeper.enable.lock", "false")
val optionsFinal = getFinalOptions(carbonProperty)
+ val tableProperties = relation.tableMeta.carbonTable.getTableInfo
+ .getFactTable.getTableProperties
+
+ optionsFinal.put("sort_scope", tableProperties.getOrDefault("sort_scope",
+ carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_SORT_SCOPE,
+ carbonProperty.getProperty(CarbonCommonConstants.LOAD_SORT_SCOPE,
+ CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT))))
+
+ optionsFinal.put("batch_sort_size_inmb", tableProperties.getOrDefault("batch_sort_size_inmb",
--- End diff --

This is needed only if using batch sort

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r136829772

--- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala ---
@@ -639,6 +639,23 @@ case class LoadTable(
val carbonProperty: CarbonProperties = CarbonProperties.getInstance()
carbonProperty.addProperty("zookeeper.enable.lock", "false")
val optionsFinal = getFinalOptions(carbonProperty)
+ val tableProperties = relation.tableMeta.carbonTable.getTableInfo
+ .getFactTable.getTableProperties
+
+ optionsFinal.put("sort_scope", tableProperties.getOrDefault("sort_scope",
+ carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_SORT_SCOPE,
+ carbonProperty.getProperty(CarbonCommonConstants.LOAD_SORT_SCOPE,
+ CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT))))
+
+ optionsFinal.put("batch_sort_size_inmb", tableProperties.getOrDefault("batch_sort_size_inmb",
+ carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_BATCH_SORT_SIZE_INMB,
+ carbonProperty.getProperty(CarbonCommonConstants.LOAD_BATCH_SORT_SIZE_INMB,
+ CarbonCommonConstants.LOAD_BATCH_SORT_SIZE_INMB_DEFAULT))))
+
+ optionsFinal.put("global_sort_partitions", tableProperties.getOrDefault("global_sort_partitions",
--- End diff --

This is needed only if using global sort

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r136829860

--- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala ---
@@ -556,21 +556,21 @@ case class LoadTable(
carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_DATEFORMAT,
CarbonLoadOptionConstants.CARBON_OPTIONS_DATEFORMAT_DEFAULT)))

- optionsFinal.put("global_sort_partitions", options.getOrElse("global_sort_partitions",
- carbonProperty
- .getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_GLOBAL_SORT_PARTITIONS, null)))
+// optionsFinal.put("global_sort_partitions", options.getOrElse("global_sort_partitions",
--- End diff --

delete unused code

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r136830112

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestGlobalSortDataLoad.scala ---
@@ -318,12 +329,12 @@ class TestGlobalSortDataLoad extends QueryTest with BeforeAndAfterEach with Befo
| charField CHAR(5),
| floatField FLOAT
| )
- | STORED BY 'org.apache.carbondata.format'
+ | STORED BY 'org.apache.carbondata.format' TBLPROPERTIES('SORT_SCOPE'='GLOBAL_SORT')
""".stripMargin)
sql(
s"""
| LOAD DATA LOCAL INPATH '$path' INTO TABLE carbon_globalsort_difftypes
- | OPTIONS('SORT_SCOPE'='GLOBAL_SORT',
+ | OPTIONS(
| 'FILEHEADER'='shortField,intField,bigintField,doubleField,stringField,timestampField,decimalField,dateField,charField,floatField')
""".stripMargin)
--- End diff --

Add a new testcase to test using SORT_SCOPE in LOAD DATA statement, it should fail

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/517/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user gvramana commented on the issue:

https://github.com/apache/carbondata/pull/1321

Unification for simplicity is ok. Ex: SORT_SCOPE provided in create table is default SORT_SCOPE for that table.
However we have scenarios where SORT_SCOPE needs to be overridden in load command for data load speed/streaming cases. So that compaction will bring back the scope to table level default value.
Option 1:
We can decide if delete load time SORT_SCOPE and bring it back later when it is fully supported.
Option 2:
Keep the syntax in data load as optional parameter and currently validate to allow only Valid values.
like Data load level BATCH_SORT when table level BATCH_SORT or LOCAL_SORT is valid, as current compaction can support this.
currently all other cases dataload level SORT_SCOPE should be same as table level SORT_SCOPE.

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user jackylk commented on the issue:

https://github.com/apache/carbondata/pull/1321

@gvramana
I suggest we consider these two requirements separately:
1. Streaming ingestion to carbon table
2. Want to override the sort scope after table is created.

For 1, I think we should probably have another table property in the create table to indicate this table is streaming ingestable. And we should have separate API instead of SQL to do the streaming ingest. Anyway, I think it is better to design after carbon 1.2 in streaming ingest branch.
For 2, I feel we should not allow this, at least for carbon 1.2, because this will open up many complexity. Not just loading, but also compaction a lot of validation and configuration to make it work for different level. We should allow this only if all combination scenario has been tested.

In one word, we should remove the SORT_SCOPE option in LOAD DATA statement for carbondata 1.2. If there are requirement for it in future, we always can add it later.

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user chenerlu commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r137441815

--- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala ---
@@ -639,6 +639,23 @@ case class LoadTable(
val carbonProperty: CarbonProperties = CarbonProperties.getInstance()
carbonProperty.addProperty("zookeeper.enable.lock", "false")
val optionsFinal = getFinalOptions(carbonProperty)
+ val tableProperties = relation.tableMeta.carbonTable.getTableInfo
+ .getFactTable.getTableProperties
+
+ optionsFinal.put("sort_scope", tableProperties.getOrDefault("sort_scope",
+ carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_SORT_SCOPE,
+ carbonProperty.getProperty(CarbonCommonConstants.LOAD_SORT_SCOPE,
+ CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT))))
+
+ optionsFinal.put("batch_sort_size_inmb", tableProperties.getOrDefault("batch_sort_size_inmb",
+ carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_BATCH_SORT_SIZE_INMB,
+ carbonProperty.getProperty(CarbonCommonConstants.LOAD_BATCH_SORT_SIZE_INMB,
+ CarbonCommonConstants.LOAD_BATCH_SORT_SIZE_INMB_DEFAULT))))
+
+ optionsFinal.put("global_sort_partitions", tableProperties.getOrDefault("global_sort_partitions",
--- End diff --

Same as batch sort size I think.

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user chenerlu commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r137441759

--- Diff: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala ---
@@ -639,6 +639,23 @@ case class LoadTable(
val carbonProperty: CarbonProperties = CarbonProperties.getInstance()
carbonProperty.addProperty("zookeeper.enable.lock", "false")
val optionsFinal = getFinalOptions(carbonProperty)
+ val tableProperties = relation.tableMeta.carbonTable.getTableInfo
+ .getFactTable.getTableProperties
+
+ optionsFinal.put("sort_scope", tableProperties.getOrDefault("sort_scope",
+ carbonProperty.getProperty(CarbonLoadOptionConstants.CARBON_OPTIONS_SORT_SCOPE,
+ carbonProperty.getProperty(CarbonCommonConstants.LOAD_SORT_SCOPE,
+ CarbonCommonConstants.LOAD_SORT_SCOPE_DEFAULT))))
+
+ optionsFinal.put("batch_sort_size_inmb", tableProperties.getOrDefault("batch_sort_size_inmb",
--- End diff --

Yes, this is only needed for batch sort, but I think if users specify this parameter in global sort, it is better to ignore it.

---

[GitHub] carbondata pull request #1321: [CARBONDATA-1438] Unify the sort column and s...

In reply to this post by qiuchenjian-2

Github user chenerlu commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1321#discussion_r137442307

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/dataload/TestGlobalSortDataLoad.scala ---
@@ -318,12 +329,12 @@ class TestGlobalSortDataLoad extends QueryTest with BeforeAndAfterEach with Befo
| charField CHAR(5),
| floatField FLOAT
| )
- | STORED BY 'org.apache.carbondata.format'
+ | STORED BY 'org.apache.carbondata.format' TBLPROPERTIES('SORT_SCOPE'='GLOBAL_SORT')
""".stripMargin)
sql(
s"""
| LOAD DATA LOCAL INPATH '$path' INTO TABLE carbon_globalsort_difftypes
- | OPTIONS('SORT_SCOPE'='GLOBAL_SORT',
+ | OPTIONS(
| 'FILEHEADER'='shortField,intField,bigintField,doubleField,stringField,timestampField,decimalField,dateField,charField,floatField')
""".stripMargin)
--- End diff --

ok

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/664/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/675/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/676/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/678/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/682/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1321

SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/689/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1321

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/12/

---

[GitHub] carbondata issue #1321: [CARBONDATA-1438] Unify the sort column and sort sco...

In reply to this post by qiuchenjian-2

Github user chenerlu commented on the issue:

https://github.com/apache/carbondata/pull/1321

please retest it

---

12345