Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295 @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected. 1. all data stored into 1 file 2. join with parquet return wrong result 3. after compaction it will store into file of bucket id 0 4. new insert flow not work in 5. the others pls check testcases added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295 @ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected. 1. all data stored into 1 file, not clustered in current code. 2. join with parquet return wrong result, even carbon tables themselves the string value use diff hashcode, the join result mismatch. we should use hash method same as spark and keep consistent value for same input. 3. after compaction it will store into file of bucket id 0. 4. new insert flow not work for bucket table. 5. the others pls check testcases added ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592490434 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/530/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592499351 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2230/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592544098 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/533/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592555376 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2233/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592917163 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/539/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592918302 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2238/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592957640 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/543/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592966050 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2243/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592972205 @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore. I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592974580 @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009 > @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case. @ravipesala is guava murmur hash the same as spark using? > @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore. > I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark. spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009 > @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case. @ravipesala is guava murmur hash the same as spark using? > @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore. > I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark. @ravipesala spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593038010 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/545/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593050200 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2245/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593054571 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/547/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593061151 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2247/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386767939 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() { */ public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false"; + /** + * Hash method of bucket table + */ + public static final String BUCKET_HASH_METHOD = "bucket_hash_method"; + public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression"; + public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression"; + public static final String BUCKET_HASH_METHOD_NATIVE = "native"; + + /** + * bucket properties + */ + public static final String BUCKET_COLUMNS = "bucketcolumns"; Review comment: Is these for table properties? suggest to change to "bucket_columns" ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768101 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -2379,4 +2379,18 @@ private CarbonCommonConstants() { */ public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false"; + /** + * Hash method of bucket table + */ + public static final String BUCKET_HASH_METHOD = "bucket_hash_method"; + public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression"; + public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression"; + public static final String BUCKET_HASH_METHOD_NATIVE = "native"; + + /** + * bucket properties + */ + public static final String BUCKET_COLUMNS = "bucketcolumns"; Review comment: Can we follow the bucket table syntax from hive? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |