[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768628
 
 

 ##########
 File path: integration/flink/src/test/scala/org/apache/carbon/flink/TestCarbonWriter.scala
 ##########
 @@ -20,20 +20,23 @@ package org.apache.carbon.flink
 import java.util.Properties
 
 import org.apache.carbondata.core.constants.CarbonCommonConstants
+import org.apache.carbondata.core.datastore.filesystem.{CarbonFile, CarbonFileFilter}
 import org.apache.flink.api.common.restartstrategy.RestartStrategies
 import org.apache.flink.core.fs.Path
 import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
 import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
-import org.apache.spark.sql.Row
+import org.apache.spark.sql.{CarbonEnv, Row}
 import org.apache.spark.sql.test.util.QueryTest
-
 import org.apache.carbondata.core.datastore.impl.FileFactory
 import org.apache.carbondata.core.util.CarbonProperties
 import org.apache.carbondata.core.util.path.CarbonTablePath
+import org.apache.spark.sql.execution.exchange.Exchange
 
 class TestCarbonWriter extends QueryTest {
 
   val tableName = "test_flink"
+  val tableName2 = "insert_bucket_table"
 
 Review comment:
   ```suggestion
     val bucketTableName = "insert_bucket_table"
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769585
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
 ##########
 @@ -584,36 +591,53 @@ class CarbonMergerRDD[K, V](
     logInfo("no.of.nodes where data present=" + nodeBlockMap.size())
     defaultParallelism = sparkContext.defaultParallelism
 
-    // Create Spark Partition for each task and assign blocks
-    nodeBlockMap.asScala.foreach { case (nodeName, splitList) =>
-      val taskSplitList = new java.util.ArrayList[NodeInfo](0)
-      nodeTaskBlocksMap.put(nodeName, taskSplitList)
-      var blockletCount = 0
-      splitList.asScala.foreach { splitInfo =>
-        val splitsPerNode = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo]
-        blockletCount = blockletCount + splitsPerNode.getCarbonInputSplitList.size()
-        taskSplitList.add(
-          NodeInfo(splitsPerNode.getTaskId, splitsPerNode.getCarbonInputSplitList.size()))
-
-        if (blockletCount != 0) {
-          val taskInfo = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo]
-          val multiBlockSplit = if (null == rangeColumn || singleRange) {
-            new CarbonMultiBlockSplit(
-              taskInfo.getCarbonInputSplitList,
-              Array(nodeName))
-          } else {
-            var splitListForRange = new util.ArrayList[CarbonInputSplit]()
-            new CarbonMultiBlockSplit(
-              splitListForRange,
-              Array(nodeName))
+    if (bucketInfo != null) {
 
 Review comment:
   extract these new logic into a private function, so not to make internalCompute too big

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
 ##########
 @@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V](
   var singleRange = false
   var expressionMapForRangeCol: util.Map[Integer, Expression] = null
   var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null
+  val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo
 
 Review comment:
   Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala
 ##########
 @@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag](
 
   private var directFill = false
 
-  private val bucketedTable = tableInfo.getFactTable.getBucketingInfo
+  private val bucketInfo = tableInfo.getFactTable.getBucketingInfo
 
 Review comment:
   Does executor side need this? Can we make it as a local variable in internalCompute

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala
 ##########
 @@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag](
 
   private var directFill = false
 
-  private val bucketedTable = tableInfo.getFactTable.getBucketingInfo
+  private val bucketInfo = tableInfo.getFactTable.getBucketingInfo
 
 Review comment:
   Does executor side need this? Can we make it as a local variable in internalCompute

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957
 
 

 ##########
 File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
 ##########
 @@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V](
   var singleRange = false
   var expressionMapForRangeCol: util.Map[Integer, Expression] = null
   var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null
+  val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo
 
 Review comment:
   Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386771262
 
 

 ##########
 File path: processing/pom.xml
 ##########
 @@ -45,6 +45,11 @@
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
 
 Review comment:
   Should try to avoid depending on spark in core modules including: carbondata-core, carbondata-hadoop, carbondata-processing
   Can we avoid adding this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386785923
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
 ##########
 @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
    */
   public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";
 
+  /**
+   * Hash method of bucket table
+   */
+  public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+  public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+  /**
+   * bucket properties
+   */
+  public static final String BUCKET_COLUMNS = "bucketcolumns";
 
 Review comment:
   @jackylk ok, will change the property name, support both property and syntax from hive already.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386817746
 
 

 ##########
 File path: processing/pom.xml
 ##########
 @@ -45,6 +45,11 @@
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
       <version>${spark.version}</version>
     </dependency>
+    <dependency>
+      <groupId>org.apache.spark</groupId>
 
 Review comment:
   @jackylk if want to keep correct join result with parquet bucket tables, need to use same methods to hash the data of each datatype, so the code is needed.
   1. copy the code from spark, but there are about 2,000 lines and if we copy the code, once spark change them we need to change together, its not a good choice, more details pls check the conversations above.
   2. depend on spark-unsafe jar, we just depend 1 jar of spark and the changes of diff spark version don't have effect on us since we use it by version control in pom.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593790391
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/570/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593813133
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2274/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594404737
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/600/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594409277
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2307/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
123