[GitHub] [carbondata] Zhangshunyu opened a new pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu opened a new pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox

Zhangshunyu opened a new pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020


    ### Why is this PR needed?
    Currentlly, minor compaction only consider the num of segments and major
   compaction only consider the SUM size of segments, but consider a scenario
   that the user want to use minor compaction by the num of segments but he
   dont want to merge the segment whose datasize larger the threshold for
   example 2GB, as it is no need to merge so much big segment and it is time
   costly.
   
    ### What changes were proposed in this PR?
   add a parameter to control the threshold of segment included
   in minor compaction, so that the user can specify the segment not included
   in minor compaction once the datasize exeed the threshold, system level and table level can be set, and if not set the use default
   value.
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - Yes
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732662506


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3118/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732662834


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4871/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732721963


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4874/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732722971


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3121/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732761517


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4879/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732822942


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4881/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732837957


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3127/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu closed pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu closed pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733497266


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733544457


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4897/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733546163


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3143/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365161



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)

Review comment:
       Just the table property is enough, why again a carbon property is needed ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365650



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)
+  public static final String CARBON_MINOR_COMPACTION_SIZE = "carbon.minor.compaction.size";
+
   /**
    * By default size of major compaction in MBs.
    */
   public static final String DEFAULT_CARBON_MAJOR_COMPACTION_SIZE = "1024";
 
+  /**
+   * By default size of minor compaction in MBs.
+   */
+  public static final String DEFAULT_CARBON_MINOR_COMPACTION_SIZE = "1048576";

Review comment:
       Don't keep default value, I have seen many user's segment more than 1TB. so for them auto compaction will not work by default. so, I suggest if table property is configured, then consider size for minor compaction. else the base behavior like consider all segments based on numbers. Also can support alter table property




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531366602



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
       can simplify the test case (no need of huge rows)
   Just insert 1 row 4 times and do minor compaction, it should work. and set table property to 1 MB and insert 1 row 4 times and do minor compaction, it shouldn't compact.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367024



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
       Also test for both partition and non partition flow




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367494



##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -311,10 +311,8 @@ public static String getLoadNumberFromLoadName(String loadName) {
       listOfSegmentsToBeMerged = identifySegmentsToBeMergedBasedOnSize(compactionSize,
               listOfSegmentsLoadedInSameDateInterval, carbonLoadModel);
     } else {
-
-      listOfSegmentsToBeMerged =
-              identifySegmentsToBeMergedBasedOnSegCount(listOfSegmentsLoadedInSameDateInterval,
-                      tableLevelProperties);
+      listOfSegmentsToBeMerged = identifySegmentsToBeMergedBasedOnSegCount(compactionSize,

Review comment:
       may be you can calculate compaction size, inside this method as table property is already there. No need to modify this method signature.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531370782



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
       @ajantha-bhat set to 1MB means the segment size > 1MB, it should be ignore in compaction flow. how can 4 lines data reach to 1MB? if use 4 lines data, both will compact.
   here we use huge data to exceed 1MB to test the hug segment is ignored




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
       @Zhangshunyu : Agree , I got confused.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction") {

Review comment:
       @Zhangshunyu : Agree , I got confused for testcase, we need more than 1 MB data for testing it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


123