Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

Classic

List

54 messages Options

Options

123

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768628

##########
File path: integration/flink/src/test/scala/org/apache/carbon/flink/TestCarbonWriter.scala
##########
@@ -20,20 +20,23 @@ package org.apache.carbon.flink
import java.util.Properties

import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.{CarbonFile, CarbonFileFilter}
import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
-import org.apache.spark.sql.Row
+import org.apache.spark.sql.{CarbonEnv, Row}
import org.apache.spark.sql.test.util.QueryTest
-
import org.apache.carbondata.core.datastore.impl.FileFactory
import org.apache.carbondata.core.util.CarbonProperties
import org.apache.carbondata.core.util.path.CarbonTablePath
+import org.apache.spark.sql.execution.exchange.Exchange

class TestCarbonWriter extends QueryTest {

val tableName = "test_flink"
+ val tableName2 = "insert_bucket_table"

Review comment:
```suggestion
val bucketTableName = "insert_bucket_table"
```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769585

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
##########
@@ -584,36 +591,53 @@ class CarbonMergerRDD[K, V](
logInfo("no.of.nodes where data present=" + nodeBlockMap.size())
defaultParallelism = sparkContext.defaultParallelism

- // Create Spark Partition for each task and assign blocks
- nodeBlockMap.asScala.foreach { case (nodeName, splitList) =>
- val taskSplitList = new java.util.ArrayList[NodeInfo](0)
- nodeTaskBlocksMap.put(nodeName, taskSplitList)
- var blockletCount = 0
- splitList.asScala.foreach { splitInfo =>
- val splitsPerNode = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo]
- blockletCount = blockletCount + splitsPerNode.getCarbonInputSplitList.size()
- taskSplitList.add(
- NodeInfo(splitsPerNode.getTaskId, splitsPerNode.getCarbonInputSplitList.size()))
-
- if (blockletCount != 0) {
- val taskInfo = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo]
- val multiBlockSplit = if (null == rangeColumn || singleRange) {
- new CarbonMultiBlockSplit(
- taskInfo.getCarbonInputSplitList,
- Array(nodeName))
- } else {
- var splitListForRange = new util.ArrayList[CarbonInputSplit]()
- new CarbonMultiBlockSplit(
- splitListForRange,
- Array(nodeName))
+ if (bucketInfo != null) {

Review comment:
extract these new logic into a private function, so not to make internalCompute too big

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
##########
@@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V](
var singleRange = false
var expressionMapForRangeCol: util.Map[Integer, Expression] = null
var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null
+ val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo

Review comment:
Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala
##########
@@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag](

private var directFill = false

- private val bucketedTable = tableInfo.getFactTable.getBucketingInfo
+ private val bucketInfo = tableInfo.getFactTable.getBucketingInfo

Review comment:
Does executor side need this? Can we make it as a local variable in internalCompute

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala
##########
@@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag](

private var directFill = false

- private val bucketedTable = tableInfo.getFactTable.getBucketingInfo
+ private val bucketInfo = tableInfo.getFactTable.getBucketingInfo

Review comment:
Does executor side need this? Can we make it as a local variable in internalCompute

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
##########
@@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V](
var singleRange = false
var expressionMapForRangeCol: util.Map[Integer, Expression] = null
var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null
+ val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo

Review comment:
Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386771262

##########
File path: processing/pom.xml
##########
@@ -45,6 +45,11 @@
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
+ <dependency>
+ <groupId>org.apache.spark</groupId>

Review comment:
Should try to avoid depending on spark in core modules including: carbondata-core, carbondata-hadoop, carbondata-processing
Can we avoid adding this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386785923

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
*/
public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";

+ /**
+ * Hash method of bucket table
+ */
+ public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+ public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+ public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+ public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+ /**
+ * bucket properties
+ */
+ public static final String BUCKET_COLUMNS = "bucketcolumns";

Review comment:
@jackylk ok, will change the property name, support both property and syntax from hive already.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386817746

##########
File path: processing/pom.xml
##########
@@ -45,6 +45,11 @@
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
+ <dependency>
+ <groupId>org.apache.spark</groupId>

Review comment:
@jackylk if want to keep correct join result with parquet bucket tables, need to use same methods to hash the data of each datatype, so the code is needed.
1. copy the code from spark, but there are about 2,000 lines and if we copy the code, once spark change them we need to change together, its not a good choice, more details pls check the conversations above.
2. depend on spark-unsafe jar, we just depend 1 jar of spark and the changes of diff spark version don't have effect on us since we use it by version control in pom.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593790391

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/570/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593813133

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2274/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594404737

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/600/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594409277

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2307/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

123