Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637 ### Why is this PR needed? Support Bucket Table consistent with spark parquet, to improve the join performance by avoid shuffle for bucket column. Fix bugs also. ### What changes were proposed in this PR? Support Bucket Table consistent with spark parquet, to improve the join performance by avoid shuffle for bucket column. Fix bugs also. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - Yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590245487 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/427/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590275188 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2127/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590296415 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/432/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590333619 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2132/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590640601 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/447/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590662074 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/449/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590679747 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2149/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590739256 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/456/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590753572 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2157/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590775160 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590784776 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/459/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590821550 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2160/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r384289659 ########## File path: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/BucketMurmur3HashPartitionerImpl.java ########## @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.processing.loading.partition.impl; + +import java.util.List; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.unsafe.hash.Murmur3_x86_32; +import org.apache.carbondata.core.unsafe.types.UTF8String; +import org.apache.carbondata.processing.loading.partition.Partitioner; + +/** + * Bucket Hash partitioner implementation using Murmur3_x86_32, it keep the same hash value as + * spark for given input. + */ +@InterfaceAudience.Internal +public class BucketMurmur3HashPartitionerImpl implements Partitioner<CarbonRow> { + + private int numberOfBuckets; + + private Hash[] hashes; + + public BucketMurmur3HashPartitionerImpl(List<Integer> indexes, List<ColumnSchema> columnSchemas, + int numberOfBuckets) { + this.numberOfBuckets = numberOfBuckets; + hashes = new Hash[indexes.size()]; + for (int i = 0; i < indexes.size(); i++) { + DataType dataType = columnSchemas.get(i).getDataType(); + if (dataType == DataTypes.LONG || dataType == DataTypes.DOUBLE) { + hashes[i] = new LongHash(indexes.get(i)); + } else if (dataType == DataTypes.SHORT || dataType == DataTypes.INT || + dataType == DataTypes.FLOAT || dataType == DataTypes.BOOLEAN) { + hashes[i] = new IntegralHash(indexes.get(i)); + } else if (DataTypes.isDecimal(dataType)) { + hashes[i] = new DecimalHash(indexes.get(i)); + } else if (dataType == DataTypes.TIMESTAMP) { Review comment: What about Hash for Date Type? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r384909642 ########## File path: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/BucketMurmur3HashPartitionerImpl.java ########## @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.carbondata.processing.loading.partition.impl; + +import java.util.List; + +import org.apache.carbondata.common.annotations.InterfaceAudience; +import org.apache.carbondata.core.datastore.row.CarbonRow; +import org.apache.carbondata.core.metadata.datatype.DataType; +import org.apache.carbondata.core.metadata.datatype.DataTypes; +import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema; +import org.apache.carbondata.core.unsafe.hash.Murmur3_x86_32; +import org.apache.carbondata.core.unsafe.types.UTF8String; +import org.apache.carbondata.processing.loading.partition.Partitioner; + +/** + * Bucket Hash partitioner implementation using Murmur3_x86_32, it keep the same hash value as + * spark for given input. + */ +@InterfaceAudience.Internal +public class BucketMurmur3HashPartitionerImpl implements Partitioner<CarbonRow> { + + private int numberOfBuckets; + + private Hash[] hashes; + + public BucketMurmur3HashPartitionerImpl(List<Integer> indexes, List<ColumnSchema> columnSchemas, + int numberOfBuckets) { + this.numberOfBuckets = numberOfBuckets; + hashes = new Hash[indexes.size()]; + for (int i = 0; i < indexes.size(); i++) { + DataType dataType = columnSchemas.get(i).getDataType(); + if (dataType == DataTypes.LONG || dataType == DataTypes.DOUBLE) { + hashes[i] = new LongHash(indexes.get(i)); + } else if (dataType == DataTypes.SHORT || dataType == DataTypes.INT || + dataType == DataTypes.FLOAT || dataType == DataTypes.BOOLEAN) { + hashes[i] = new IntegralHash(indexes.get(i)); + } else if (DataTypes.isDecimal(dataType)) { + hashes[i] = new DecimalHash(indexes.get(i)); + } else if (dataType == DataTypes.TIMESTAMP) { Review comment: @Indhumathi27 if use hash for datatype the hash value will diff from spark, and join result will mismatch with parquet etc. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591780044 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/506/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591801851 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2205/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591972424 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/512/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592452095 @Zhangshunyu Bucketing is already supported in Carbon. I wonder why all this code is added again to support it. If there are any issues if we are facing please put the testcases first which are not working or raise a jira. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295 @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. 1. all data stored into 1 file 2. join with parquet return wrong result 3. after compaction it will store into file of bucket id 0 4. new insert flow not work in 5. the others pls check testcases added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |