Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Zhangshunyu opened a new pull request #3637: [WIP] Support Bucket Table

Classic

List

54 messages Options

Options

123

GitBox

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295

@ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected.
1. all data stored into 1 file
2. join with parquet return wrong result
3. after compaction it will store into file of bucket id 0
4. new insert flow not work in
5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table

In reply to this post by GitBox

Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Support Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592454295

@ravipesala pls check all the new testcases added in TableBucketingTestCase and the comment i added in the pr desc. we have this feature but not work fine as expected.
1. all data stored into 1 file, not clustered in current code.
2. join with parquet return wrong result, even carbon tables themselves the string value use diff hashcode, the join result mismatch. we should use hash method same as spark and keep consistent value for same input.
3. after compaction it will store into file of bucket id 0.
4. new insert flow not work for bucket table.
5. the others pls check testcases added

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592490434

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/530/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592499351

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2230/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592544098

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/533/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592555376

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2233/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592917163

Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/539/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592918302

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2238/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592957640

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/543/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592966050

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2243/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592972205

@Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

ravipesala commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-592974580

@Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

Zhangshunyu commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009

> @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.

@ravipesala is guava murmur hash the same as spark using?

> @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
> I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.

spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

Zhangshunyu edited a comment on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593037009

> @Zhangshunyu other way is to let the spark do the bucketing like how the partitioner is implemented. In fact, we can add the bucketing directly into the partition flow. Not much changes needed in that case.

@ravipesala is guava murmur hash the same as spark using?

> @Zhangshunyu It was a supported feature earlier but it is bad that code got removed some time back. Anyway, spark changed the hashing technique on creating buckets so we cannot rely on our own hashing anymore.
> I see a lot of code got copied spark to just get the hashing. it is not recommended to do so as in the future if they change it will again break. Even they follow industry-standard murmur hash to do the hash. So please use the guava library and do the murmur hashing. Please don't copy the code unnecessarily from the spark.

@ravipesala spark using guava hash but not all the same like guava's impl, as for the changes in future of spark, if we want to keep same hash code as spark, maybe we can depend on spark-unsafe jar directly base on spark-version just like carbon depend on diff spark version.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593038010

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/545/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593050200

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2245/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593054571

Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/547/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

CarbonDataQA1 commented on issue #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#issuecomment-593061151

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2247/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386767939

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
*/
public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";

+ /**
+ * Hash method of bucket table
+ */
+ public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+ public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+ public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+ public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+ /**
+ * bucket properties
+ */
+ public static final String BUCKET_COLUMNS = "bucketcolumns";

Review comment:
Is these for table properties? suggest to change to "bucket_columns"

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

GitBox

[GitHub] [carbondata] jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table

In reply to this post by GitBox

jackylk commented on a change in pull request #3637: [CARBONDATA-3721][CARBONDATA-3590] Optimize Bucket Table
URL: https://github.com/apache/carbondata/pull/3637#discussion_r386768101

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -2379,4 +2379,18 @@ private CarbonCommonConstants() {
*/
public static final String CARBON_SI_SEGMENT_MERGE_DEFAULT = "false";

+ /**
+ * Hash method of bucket table
+ */
+ public static final String BUCKET_HASH_METHOD = "bucket_hash_method";
+ public static final String BUCKET_HASH_METHOD_DEFAULT = "spark_hash_expression";
+ public static final String BUCKET_HASH_METHOD_SPARK_EXPRESSION = "spark_hash_expression";
+ public static final String BUCKET_HASH_METHOD_NATIVE = "native";
+
+ /**
+ * bucket properties
+ */
+ public static final String BUCKET_COLUMNS = "bucketcolumns";

Review comment:
Can we follow the bucket table syntax from hive?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

With regards,
Apache Git Services

123