[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295
 
 
   @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected.
   1. all data stored into 1 file
   2. join with parquet return wrong result
   3. after compaction it will store into file of bucket id 0
   4. new insert flow not work in
   5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295
 
 
   @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected.
   1. all data stored into 1 file, not clustered in current code.
   2. join with parquet return wrong result, even carbon tables themselves the string value use diff hashcode, the join result mismatch. we should use hash method same as spark and keep consistent value for same input.
   3. after compaction it will store into file of bucket id 0.
   4. new insert flow not work for bucket table.
   5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592490434
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/530/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592499351
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2230/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592544098
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/533/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592555376
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2233/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592917163
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/539/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592918302
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2238/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592957640
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/543/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592966050
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2243/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592972205
 
 
   @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
   I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592974580
 
 
   @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009
 
 
   > @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.
   
   @ravipesala is guava murmur hash the same as spark using?
   
   > @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
   > I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.
   
   spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009
 
 
   > @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.
   
   @ravipesala is guava murmur hash the same as spark using?
   
   > @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
   > I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.
   
   @ravipesala  spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593038010
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/545/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593050200
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2245/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593054571
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/547/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593061151
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2247/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386767939
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
 ##########
 @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
    */
   public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";
 
+  /**
+   * Hash method of bucket table
+   */
+  public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+  public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+  /**
+   * bucket properties
+   */
+  public static final String BUCKET_COLUMNS = "bucketcolumns";
 
 Review comment:
   Is these for table properties? suggest to change to "bucket_columns"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768101
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
 ##########
 @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
    */
   public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";
 
+  /**
+   * Hash method of bucket table
+   */
+  public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+  public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+  public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+  /**
+   * bucket properties
+   */
+  public static final String BUCKET_COLUMNS = "bucketcolumns";
 
 Review comment:
   Can we follow the bucket table syntax from hive?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
123