[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

GitBox
Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637
 
 
    ### Why is this PR needed?
    Support Bucket Table consistent with spark parquet, to improve the join performance by avoid shuffle for bucket column. Fix bugs also.
   
    ### What changes were proposed in this PR?
   Support Bucket Table consistent with spark parquet, to improve the join performance by avoid shuffle for bucket column. Fix bugs also.
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - Yes
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590245487
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/427/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590275188
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2127/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590296415
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/432/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590333619
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2132/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590640601
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/447/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590662074
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/449/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590679747
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2149/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590739256
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/456/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590753572
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2157/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590775160
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590784776
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/459/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [WIP][CARBONDATA-3721] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-590821550
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2160/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
Indhumathi27 commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r384289659
 
 

 ##########
 File path: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/BucketMurmur3HashPartitionerImpl.java
 ##########
 @@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.processing.loading.partition.impl;
+
+import java.util.List;
+
+import org.apache.carbondata.common.annotations.InterfaceAudience;
+import org.apache.carbondata.core.datastore.row.CarbonRow;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
+import org.apache.carbondata.core.unsafe.hash.Murmur3_x86_32;
+import org.apache.carbondata.core.unsafe.types.UTF8String;
+import org.apache.carbondata.processing.loading.partition.Partitioner;
+
+/**
+ * Bucket Hash partitioner implementation using Murmur3_x86_32, it keep the same hash value as
+ * spark for given input.
+ */
+@InterfaceAudience.Internal
+public class BucketMurmur3HashPartitionerImpl implements Partitioner<CarbonRow> {
+
+  private int numberOfBuckets;
+
+  private Hash[] hashes;
+
+  public BucketMurmur3HashPartitionerImpl(List<Integer> indexes, List<ColumnSchema> columnSchemas,
+                                          int numberOfBuckets) {
+    this.numberOfBuckets = numberOfBuckets;
+    hashes = new Hash[indexes.size()];
+    for (int i = 0; i < indexes.size(); i++) {
+      DataType dataType = columnSchemas.get(i).getDataType();
+      if (dataType == DataTypes.LONG || dataType == DataTypes.DOUBLE) {
+        hashes[i] = new LongHash(indexes.get(i));
+      } else if (dataType == DataTypes.SHORT || dataType == DataTypes.INT ||
+          dataType == DataTypes.FLOAT || dataType == DataTypes.BOOLEAN) {
+        hashes[i] = new IntegralHash(indexes.get(i));
+      } else if (DataTypes.isDecimal(dataType)) {
+        hashes[i] = new DecimalHash(indexes.get(i));
+      } else if (dataType == DataTypes.TIMESTAMP) {
 
 Review comment:
   What about Hash for Date Type?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r384909642
 
 

 ##########
 File path: processing/src/main/java/org/apache/carbondata/processing/loading/partition/impl/BucketMurmur3HashPartitionerImpl.java
 ##########
 @@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.carbondata.processing.loading.partition.impl;
+
+import java.util.List;
+
+import org.apache.carbondata.common.annotations.InterfaceAudience;
+import org.apache.carbondata.core.datastore.row.CarbonRow;
+import org.apache.carbondata.core.metadata.datatype.DataType;
+import org.apache.carbondata.core.metadata.datatype.DataTypes;
+import org.apache.carbondata.core.metadata.schema.table.column.ColumnSchema;
+import org.apache.carbondata.core.unsafe.hash.Murmur3_x86_32;
+import org.apache.carbondata.core.unsafe.types.UTF8String;
+import org.apache.carbondata.processing.loading.partition.Partitioner;
+
+/**
+ * Bucket Hash partitioner implementation using Murmur3_x86_32, it keep the same hash value as
+ * spark for given input.
+ */
+@InterfaceAudience.Internal
+public class BucketMurmur3HashPartitionerImpl implements Partitioner<CarbonRow> {
+
+  private int numberOfBuckets;
+
+  private Hash[] hashes;
+
+  public BucketMurmur3HashPartitionerImpl(List<Integer> indexes, List<ColumnSchema> columnSchemas,
+                                          int numberOfBuckets) {
+    this.numberOfBuckets = numberOfBuckets;
+    hashes = new Hash[indexes.size()];
+    for (int i = 0; i < indexes.size(); i++) {
+      DataType dataType = columnSchemas.get(i).getDataType();
+      if (dataType == DataTypes.LONG || dataType == DataTypes.DOUBLE) {
+        hashes[i] = new LongHash(indexes.get(i));
+      } else if (dataType == DataTypes.SHORT || dataType == DataTypes.INT ||
+          dataType == DataTypes.FLOAT || dataType == DataTypes.BOOLEAN) {
+        hashes[i] = new IntegralHash(indexes.get(i));
+      } else if (DataTypes.isDecimal(dataType)) {
+        hashes[i] = new DecimalHash(indexes.get(i));
+      } else if (dataType == DataTypes.TIMESTAMP) {
 
 Review comment:
   @Indhumathi27 if use hash for datatype the hash value will diff from spark, and join result will mismatch with parquet etc.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591780044
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/506/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591801851
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2205/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-591972424
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/512/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592452095
 
 
   @Zhangshunyu Bucketing is already supported in Carbon.  I wonder why all this code is added again to support it.  If there are any issues if we are facing please put the testcases first which are not working or raise a jira.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295
 
 
   @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc.
   1. all data stored into 1 file
   2. join with parquet return wrong result
   3. after compaction it will store into file of bucket id 0
   4. new insert flow not work in
   5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
123