jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768628 ########## File path: integration/flink/src/test/scala/org/apache/carbon/flink/TestCarbonWriter.scala ########## @@ -20,20 +20,23 @@ package org.apache.carbon.flink import java.util.Properties import org.apache.carbondata.core.constants.CarbonCommonConstants +import org.apache.carbondata.core.datastore.filesystem.{CarbonFile, CarbonFileFilter} import org.apache.flink.api.common.restartstrategy.RestartStrategies import org.apache.flink.core.fs.Path import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink -import org.apache.spark.sql.Row +import org.apache.spark.sql.{CarbonEnv, Row} import org.apache.spark.sql.test.util.QueryTest - import org.apache.carbondata.core.datastore.impl.FileFactory import org.apache.carbondata.core.util.CarbonProperties import org.apache.carbondata.core.util.path.CarbonTablePath +import org.apache.spark.sql.execution.exchange.Exchange class TestCarbonWriter extends QueryTest { val tableName = "test_flink" + val tableName2 = "insert_bucket_table" Review comment: ```suggestion val bucketTableName = "insert_bucket_table" ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769585 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala ########## @@ -584,36 +591,53 @@ class CarbonMergerRDD[K, V]( logInfo("no.of.nodes where data present=" + nodeBlockMap.size()) defaultParallelism = sparkContext.defaultParallelism - // Create Spark Partition for each task and assign blocks - nodeBlockMap.asScala.foreach { case (nodeName, splitList) => - val taskSplitList = new java.util.ArrayList[NodeInfo](0) - nodeTaskBlocksMap.put(nodeName, taskSplitList) - var blockletCount = 0 - splitList.asScala.foreach { splitInfo => - val splitsPerNode = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo] - blockletCount = blockletCount + splitsPerNode.getCarbonInputSplitList.size() - taskSplitList.add( - NodeInfo(splitsPerNode.getTaskId, splitsPerNode.getCarbonInputSplitList.size())) - - if (blockletCount != 0) { - val taskInfo = splitInfo.asInstanceOf[CarbonInputSplitTaskInfo] - val multiBlockSplit = if (null == rangeColumn || singleRange) { - new CarbonMultiBlockSplit( - taskInfo.getCarbonInputSplitList, - Array(nodeName)) - } else { - var splitListForRange = new util.ArrayList[CarbonInputSplit]() - new CarbonMultiBlockSplit( - splitListForRange, - Array(nodeName)) + if (bucketInfo != null) { Review comment: extract these new logic into a private function, so not to make internalCompute too big ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala ########## @@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V]( var singleRange = false var expressionMapForRangeCol: util.Map[Integer, Expression] = null var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null + val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo Review comment: Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala ########## @@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag]( private var directFill = false - private val bucketedTable = tableInfo.getFactTable.getBucketingInfo + private val bucketInfo = tableInfo.getFactTable.getBucketingInfo Review comment: Does executor side need this? Can we make it as a local variable in internalCompute ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386770338 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonScanRDD.scala ########## @@ -96,7 +96,7 @@ class CarbonScanRDD[T: ClassTag]( private var directFill = false - private val bucketedTable = tableInfo.getFactTable.getBucketingInfo + private val bucketInfo = tableInfo.getFactTable.getBucketingInfo Review comment: Does executor side need this? Can we make it as a local variable in internalCompute ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386769957 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala ########## @@ -91,6 +91,7 @@ class CarbonMergerRDD[K, V]( var singleRange = false var expressionMapForRangeCol: util.Map[Integer, Expression] = null var broadCastSplits: Broadcast[CarbonInputSplitWrapper] = null + val bucketInfo = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.getBucketingInfo Review comment: Seems executor side does not need this variable, so move it into internalCompute to make it as a local variable ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386771262 ########## File path: processing/pom.xml ########## @@ -45,6 +45,11 @@ <artifactId>spark-sql_${scala.binary.version}</artifactId> <version>${spark.version}</version> </dependency> + <dependency> + <groupId>org.apache.spark</groupId> Review comment: Should try to avoid depending on spark in core modules including: carbondata-core, carbondata-hadoop, carbondata-processing Can we avoid adding this? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386785923 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() { */ public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false"; + /** + * Hash method of bucket table + */ + public static final String BUCKET_HASH_METHOD = "bucket_hash_method"; + public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression"; + public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression"; + public static final String BUCKET_HASH_METHOD_NATIVE = "native"; + + /** + * bucket properties + */ + public static final String BUCKET_COLUMNS = "bucketcolumns"; Review comment: @jackylk ok, will change the property name, support both property and syntax from hive already. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386817746 ########## File path: processing/pom.xml ########## @@ -45,6 +45,11 @@ <artifactId>spark-sql_${scala.binary.version}</artifactId> <version>${spark.version}</version> </dependency> + <dependency> + <groupId>org.apache.spark</groupId> Review comment: @jackylk if want to keep correct join result with parquet bucket tables, need to use same methods to hash the data of each datatype, so the code is needed. 1. copy the code from spark, but there are about 2,000 lines and if we copy the code, once spark change them we need to change together, its not a good choice, more details pls check the conversations above. 2. depend on spark-unsafe jar, we just depend 1 jar of spark and the changes of diff spark version don't have effect on us since we use it by version control in pom. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593790391 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/570/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593813133 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2274/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594404737 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/600/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-594409277 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2307/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
asfgit closed pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |