Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Zhangshunyu opened a new pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

Classic

List

59 messages Options

Options

123

GitBox

[GitHub] [carbondata] Zhangshunyu opened a new pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

Zhangshunyu opened a new pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020

### Why is this PR needed?
Currentlly, minor compaction only consider the num of segments and major
compaction only consider the SUM size of segments, but consider a scenario
that the user want to use minor compaction by the num of segments but he
dont want to merge the segment whose datasize larger the threshold for
example 2GB, as it is no need to merge so much big segment and it is time
costly.

### What changes were proposed in this PR?
add a parameter to control the threshold of segment included
in minor compaction, so that the user can specify the segment not included
in minor compaction once the datasize exeed the threshold, system level and table level can be set, and if not set the use default
value.
### Does this PR introduce any user interface change?
- No

### Is any new testcase added?
- Yes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732662506

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3118/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732662834

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4871/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732721963

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4874/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732722971

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3121/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732761517

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4879/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732822942

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4881/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732837957

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3127/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] Zhangshunyu closed pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

Zhangshunyu closed pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] Zhangshunyu commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

Zhangshunyu commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733497266

retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733544457

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4897/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733546163

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3143/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365161

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
@CarbonProperty(dynamicConfigurable = true)
public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size";

+ /**
+ * Size of Minor Compaction in MBs
+ */
+ @CarbonProperty(dynamicConfigurable = true)

Review comment:
Just the table property is enough, why again a carbon property is needed ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365650

##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
@CarbonProperty(dynamicConfigurable = true)
public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size";

+ /**
+ * Size of Minor Compaction in MBs
+ */
+ @CarbonProperty(dynamicConfigurable = true)
+ public static final String CARBON_MINOR_COMPACTION_SIZE = "carbon.minor.compaction.size";
+
/**
* By default size of major compaction in MBs.
*/
public static final String DEFAULT_CARBON_MAJOR_COMPACTION_SIZE = "1024";

+ /**
+ * By default size of minor compaction in MBs.
+ */
+ public static final String DEFAULT_CARBON_MINOR_COMPACTION_SIZE = "1048576";

Review comment:
Don't keep default value, I have seen many user's segment more than 1TB. so for them auto compaction will not work by default. so, I suggest if table property is configured, then consider size for minor compaction. else the base behavior like consider all segments based on numbers. Also can support alter table property

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531366602

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll

}

+ def generateData(numOrders: Int = 100000): DataFrame = {
+ import sqlContext.implicits._
+ sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+ .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+ "serialname" + x, x + 10000)
+ }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+ }
+
+ test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
can simplify the test case (no need of huge rows)
Just insert 1 row 4 times and do minor compaction, it should work. and set table property to 1 MB and insert 1 row 4 times and do minor compaction, it shouldn't compact.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367024

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll

}

+ def generateData(numOrders: Int = 100000): DataFrame = {
+ import sqlContext.implicits._
+ sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+ .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+ "serialname" + x, x + 10000)
+ }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+ }
+
+ test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
Also test for both partition and non partition flow

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367494

##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -311,10 +311,8 @@ public static String getLoadNumberFromLoadName(String loadName) {
listOfSegmentsToBeMerged = identifySegmentsToBeMergedBasedOnSize(compactionSize,
listOfSegmentsLoadedInSameDateInterval, carbonLoadModel);
} else {
-
- listOfSegmentsToBeMerged =
- identifySegmentsToBeMergedBasedOnSegCount(listOfSegmentsLoadedInSameDateInterval,
- tableLevelProperties);
+ listOfSegmentsToBeMerged = identifySegmentsToBeMergedBasedOnSegCount(compactionSize,

Review comment:
may be you can calculate compaction size, inside this method as table property is already there. No need to modify this method signature.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531370782

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll

}

+ def generateData(numOrders: Int = 100000): DataFrame = {
+ import sqlContext.implicits._
+ sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+ .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+ "serialname" + x, x + 10000)
+ }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+ }
+
+ test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
@ajantha-bhat set to 1MB means the segment size > 1MB, it should be ignore in compaction flow. how can 4 lines data reach to 1MB? if use 4 lines data, both will compact.
here we use huge data to exceed 1MB to test the hug segment is ignored

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll

}

+ def generateData(numOrders: Int = 100000): DataFrame = {
+ import sqlContext.implicits._
+ sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+ .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+ "serialname" + x, x + 10000)
+ }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+ }
+
+ test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
@Zhangshunyu : Agree , I got confused.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll

}

+ def generateData(numOrders: Int = 100000): DataFrame = {
+ import sqlContext.implicits._
+ sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+ .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+ "serialname" + x, x + 10000)
+ }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+ }
+
+ test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
@Zhangshunyu : Agree , I got confused for testcase, we need more than 1 MB data for testing it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

123