Zhangshunyu opened a new pull request #4020: URL: https://github.com/apache/carbondata/pull/4020 ### Why is this PR needed? Currentlly, minor compaction only consider the num of segments and major compaction only consider the SUM size of segments, but consider a scenario that the user want to use minor compaction by the num of segments but he dont want to merge the segment whose datasize larger the threshold for example 2GB, as it is no need to merge so much big segment and it is time costly. ### What changes were proposed in this PR? add a parameter to control the threshold of segment included in minor compaction, so that the user can specify the segment not included in minor compaction once the datasize exeed the threshold, system level and table level can be set, and if not set the use default value. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - Yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732662506 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3118/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732662834 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4871/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732721963 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4874/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732722971 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3121/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732761517 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4879/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732822942 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4881/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-732837957 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3127/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Zhangshunyu closed pull request #4020: URL: https://github.com/apache/carbondata/pull/4020 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Zhangshunyu commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733497266 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733544457 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4897/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#issuecomment-733546163 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3143/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365161 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -736,11 +738,22 @@ private CarbonCommonConstants() { @CarbonProperty(dynamicConfigurable = true) public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size"; + /** + * Size of Minor Compaction in MBs + */ + @CarbonProperty(dynamicConfigurable = true) Review comment: Just the table property is enough, why again a carbon property is needed ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531365650 ########## File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java ########## @@ -736,11 +738,22 @@ private CarbonCommonConstants() { @CarbonProperty(dynamicConfigurable = true) public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size"; + /** + * Size of Minor Compaction in MBs + */ + @CarbonProperty(dynamicConfigurable = true) + public static final String CARBON_MINOR_COMPACTION_SIZE = "carbon.minor.compaction.size"; + /** * By default size of major compaction in MBs. */ public static final String DEFAULT_CARBON_MAJOR_COMPACTION_SIZE = "1024"; + /** + * By default size of minor compaction in MBs. + */ + public static final String DEFAULT_CARBON_MINOR_COMPACTION_SIZE = "1048576"; Review comment: Don't keep default value, I have seen many user's segment more than 1TB. so for them auto compaction will not work by default. so, I suggest if table property is configured, then consider size for minor compaction. else the base behavior like consider all segments based on numbers. Also can support alter table property ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531366602 ########## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala ########## @@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll } + def generateData(numOrders: Int = 100000): DataFrame = { + import sqlContext.implicits._ + sqlContext.sparkContext.parallelize(1 to numOrders, 4) + .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x, + "serialname" + x, x + 10000) + }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary") + } + + test("test skip segment whose data size exceed threshold in minor compaction") { Review comment: can simplify the test case (no need of huge rows) Just insert 1 row 4 times and do minor compaction, it should work. and set table property to 1 MB and insert 1 row 4 times and do minor compaction, it shouldn't compact. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367024 ########## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala ########## @@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll } + def generateData(numOrders: Int = 100000): DataFrame = { + import sqlContext.implicits._ + sqlContext.sparkContext.parallelize(1 to numOrders, 4) + .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x, + "serialname" + x, x + 10000) + }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary") + } + + test("test skip segment whose data size exceed threshold in minor compaction") { Review comment: Also test for both partition and non partition flow ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531367494 ########## File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java ########## @@ -311,10 +311,8 @@ public static String getLoadNumberFromLoadName(String loadName) { listOfSegmentsToBeMerged = identifySegmentsToBeMergedBasedOnSize(compactionSize, listOfSegmentsLoadedInSameDateInterval, carbonLoadModel); } else { - - listOfSegmentsToBeMerged = - identifySegmentsToBeMergedBasedOnSegCount(listOfSegmentsLoadedInSameDateInterval, - tableLevelProperties); + listOfSegmentsToBeMerged = identifySegmentsToBeMergedBasedOnSegCount(compactionSize, Review comment: may be you can calculate compaction size, inside this method as table property is already there. No need to modify this method signature. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Zhangshunyu commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531370782 ########## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala ########## @@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll } + def generateData(numOrders: Int = 100000): DataFrame = { + import sqlContext.implicits._ + sqlContext.sparkContext.parallelize(1 to numOrders, 4) + .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x, + "serialname" + x, x + 10000) + }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary") + } + + test("test skip segment whose data size exceed threshold in minor compaction") { Review comment: @ajantha-bhat set to 1MB means the segment size > 1MB, it should be ignore in compaction flow. how can 4 lines data reach to 1MB? if use 4 lines data, both will compact. here we use huge data to exceed 1MB to test the hug segment is ignored ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146 ########## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala ########## @@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll } + def generateData(numOrders: Int = 100000): DataFrame = { + import sqlContext.implicits._ + sqlContext.sparkContext.parallelize(1 to numOrders, 4) + .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x, + "serialname" + x, x + 10000) + }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary") + } + + test("test skip segment whose data size exceed threshold in minor compaction") { Review comment: @Zhangshunyu : Agree , I got confused. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #4020: URL: https://github.com/apache/carbondata/pull/4020#discussion_r531371146 ########## File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala ########## @@ -186,6 +187,78 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll } + def generateData(numOrders: Int = 100000): DataFrame = { + import sqlContext.implicits._ + sqlContext.sparkContext.parallelize(1 to numOrders, 4) + .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x, + "serialname" + x, x + 10000) + }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary") + } + + test("test skip segment whose data size exceed threshold in minor compaction") { Review comment: @Zhangshunyu : Agree , I got confused for testcase, we need more than 1 MB data for testing it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |