[GitHub] [carbondata] Zhangshunyu opened a new pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531372454



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)

Review comment:
       @ajantha-bhat some users need system value to control all tables if use same value, like major compaction system value. I think by default we dont use it, and the user can specify for each table, if he want to set for all tables, he can use the system level parameter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r531372539



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -736,11 +738,22 @@ private CarbonCommonConstants() {
   @CarbonProperty(dynamicConfigurable = true)
   public static final String CARBON_MAJOR_COMPACTION_SIZE = "carbon.major.compaction.size";
 
+  /**
+   * Size of Minor Compaction in MBs
+   */
+  @CarbonProperty(dynamicConfigurable = true)
+  public static final String CARBON_MINOR_COMPACTION_SIZE = "carbon.minor.compaction.size";
+
   /**
    * By default size of major compaction in MBs.
    */
   public static final String DEFAULT_CARBON_MAJOR_COMPACTION_SIZE = "1024";
 
+  /**
+   * By default size of minor compaction in MBs.
+   */
+  public static final String DEFAULT_CARBON_MINOR_COMPACTION_SIZE = "1048576";

Review comment:
       @ajantha-bhat Yes, 1TB is not proper here. Will remove it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-734733376


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4949/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-734738751


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3194/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532361537



##########
File path: docs/dml-of-carbondata.md
##########
@@ -529,6 +529,10 @@ CarbonData DML statements are documented here,which includes:
   * Level 1: Merging of the segments which are not yet compacted.
   * Level 2: Merging of the compacted segments again to form a larger segment.
 
+  The segment whose data size exceed limit of carbon.minor.compaction.size will not be included in
+  minor compaction. If user want to control the size of segment included in minor compaction,
+  configure the property with appropriate value in MB, if not configure, will merge segments only
+  based on num of segments.

Review comment:
       ```suggestion
     based on number of segments.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532362299



##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##########
@@ -998,6 +998,23 @@ public long getMajorCompactionSize() {
     return compactionSize;
   }
 
+  /**
+   * returns minor compaction size value from carbon properties or 0 if it is not valid or
+   * not configured
+   *
+   * @return compactionSize
+   */
+  public long getMinorCompactionSize() {
+    long compactionSize = 0;
+    try {
+      compactionSize = Long.parseLong(getProperty(

Review comment:
       please handle if user sets negative value.
   user can set from 0 to LONG_MAX I think.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532362493



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE
+    if (tableProperties.get(tblPropName).isDefined) {
+      val minorCompactionSizeStr: String =
+        parsePropertyValueStringInMB(tableProperties(tblPropName))
+      try {
+        minorCompactionSize = Integer.parseInt(minorCompactionSizeStr)
+      } catch {
+        case e: NumberFormatException =>
+          throw new MalformedCarbonCommandException(s"Invalid $tblPropName value found: " +
+            s"$minorCompactionSizeStr, only int value greater than 0 is supported.")

Review comment:
       0 is also supported right ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532362600



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE
+    if (tableProperties.get(tblPropName).isDefined) {
+      val minorCompactionSizeStr: String =
+        parsePropertyValueStringInMB(tableProperties(tblPropName))
+      try {
+        minorCompactionSize = Integer.parseInt(minorCompactionSizeStr)
+      } catch {
+        case e: NumberFormatException =>
+          throw new MalformedCarbonCommandException(s"Invalid $tblPropName value found: " +
+            s"$minorCompactionSizeStr, only int value greater than 0 is supported.")
+      }
+      if (minorCompactionSize < 0) {

Review comment:
       if zero is not supported, then <= 0




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532363057



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/table/CarbonDescribeFormattedCommand.scala
##########
@@ -191,6 +191,10 @@ private[sql] case class CarbonDescribeFormattedCommand(
           CarbonProperties.getInstance()
             .getProperty(CarbonCommonConstants.CARBON_MAJOR_COMPACTION_SIZE,
               CarbonCommonConstants.DEFAULT_CARBON_MAJOR_COMPACTION_SIZE)), ""),
+      (CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE.toUpperCase,
+        tblProps.getOrElse(CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE,
+          CarbonProperties.getInstance()
+            .getProperty(CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE, "0")), ""),

Review comment:
       ok, lets keep min configurable to 1 MB and if the user not configured, we can keep -1. because 0 means, as per property skip all the segments that are greater than 0 MB.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532364698



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,206 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x % 10,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction " +
+    "in system level control") {
+    CarbonProperties.getInstance().addProperty("carbon.compaction.level.threshold", "2,2")
+    CarbonProperties.getInstance()
+      .addProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, "mm/dd/yyyy")
+    // set threshold to 1MB in this test case
+    CarbonProperties.getInstance().addProperty("carbon.minor.compaction.size", "1")
+
+    sql("drop table if exists  minor_threshold")
+    sql("drop table if exists  tmp")
+
+    sql(
+      "CREATE TABLE IF NOT EXISTS minor_threshold (country String, ID Int, date Timestamp," +
+        " name String, phonetype String, serialname String, salary Int) STORED AS carbondata"
+    )
+    sql(
+      "CREATE TABLE IF NOT EXISTS tmp (country String, ID Int, date Timestamp," +
+        " name String, phonetype String, serialname String, salary Int) STORED AS carbondata"
+    )
+
+    val initframe = generateData(100000)
+    initframe.write
+      .format("carbondata")
+      .option("tablename", "tmp")
+      .mode(SaveMode.Overwrite)
+      .save()
+    // load 3 segments
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath2 + "' INTO TABLE minor_threshold  OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+
+    // insert a new segment(id is 3) data size exceed 1 MB
+    sql("insert into minor_threshold select * from tmp")
+
+    // load another 3 segments
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath2 + "' INTO TABLE minor_threshold  OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+
+    sql("show segments for table minor_threshold").show(100, false)
+    // do minor compaction
+    sql("alter table minor_threshold compact 'minor'"
+    )
+    // check segment 3 whose size exceed the limit should not be compacted
+    val carbonTable = CarbonMetadata.getInstance().getCarbonTable(
+      CarbonCommonConstants.DATABASE_DEFAULT_NAME, "minor_threshold")
+    val carbonTablePath = carbonTable.getMetadataPath
+    val segments = SegmentStatusManager.readLoadMetadata(carbonTablePath);
+    assertResult(SegmentStatus.SUCCESS)(segments(3).getSegmentStatus)
+    assertResult(100030)(sql("select count(*) from minor_threshold").collect().head.get(0))
+    // reset to 0
+    CarbonProperties.getInstance().addProperty("carbon.minor.compaction.size", "0")
+  }

Review comment:
       support dynamically changing the table property also, by alter table set/unset tblpeoperties command. With that in the same testcase you can test table level by loading some more data. No need to create new tables again to test it

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,206 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x % 10,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction " +
+    "in system level control") {
+    CarbonProperties.getInstance().addProperty("carbon.compaction.level.threshold", "2,2")
+    CarbonProperties.getInstance()
+      .addProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, "mm/dd/yyyy")
+    // set threshold to 1MB in this test case
+    CarbonProperties.getInstance().addProperty("carbon.minor.compaction.size", "1")
+
+    sql("drop table if exists  minor_threshold")
+    sql("drop table if exists  tmp")
+
+    sql(
+      "CREATE TABLE IF NOT EXISTS minor_threshold (country String, ID Int, date Timestamp," +
+        " name String, phonetype String, serialname String, salary Int) STORED AS carbondata"
+    )
+    sql(
+      "CREATE TABLE IF NOT EXISTS tmp (country String, ID Int, date Timestamp," +
+        " name String, phonetype String, serialname String, salary Int) STORED AS carbondata"
+    )
+
+    val initframe = generateData(100000)
+    initframe.write
+      .format("carbondata")
+      .option("tablename", "tmp")
+      .mode(SaveMode.Overwrite)
+      .save()
+    // load 3 segments
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath2 + "' INTO TABLE minor_threshold  OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+
+    // insert a new segment(id is 3) data size exceed 1 MB
+    sql("insert into minor_threshold select * from tmp")
+
+    // load another 3 segments
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath2 + "' INTO TABLE minor_threshold  OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+
+    sql("show segments for table minor_threshold").show(100, false)
+    // do minor compaction
+    sql("alter table minor_threshold compact 'minor'"
+    )
+    // check segment 3 whose size exceed the limit should not be compacted
+    val carbonTable = CarbonMetadata.getInstance().getCarbonTable(
+      CarbonCommonConstants.DATABASE_DEFAULT_NAME, "minor_threshold")
+    val carbonTablePath = carbonTable.getMetadataPath
+    val segments = SegmentStatusManager.readLoadMetadata(carbonTablePath);
+    assertResult(SegmentStatus.SUCCESS)(segments(3).getSegmentStatus)
+    assertResult(100030)(sql("select count(*) from minor_threshold").collect().head.get(0))
+    // reset to 0
+    CarbonProperties.getInstance().addProperty("carbon.minor.compaction.size", "0")
+  }

Review comment:
       support dynamically changing the table property also, by alter table set/unset tblproperties command. With that in the same testcase you can test table level by loading some more data. No need to create new tables again to test it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532405435



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/MajorCompactionIgnoreInMinorTest.scala
##########
@@ -186,6 +187,206 @@ class MajorCompactionIgnoreInMinorTest extends QueryTest with BeforeAndAfterAll
 
   }
 
+  def generateData(numOrders: Int = 100000): DataFrame = {
+    import sqlContext.implicits._
+    sqlContext.sparkContext.parallelize(1 to numOrders, 4)
+      .map { x => ("country" + x, x, "07/23/2015", "name" + x, "phonetype" + x % 10,
+        "serialname" + x, x + 10000)
+      }.toDF("country", "ID", "date", "name", "phonetype", "serialname", "salary")
+  }
+
+  test("test skip segment whose data size exceed threshold in minor compaction " +
+    "in system level control") {
+    CarbonProperties.getInstance().addProperty("carbon.compaction.level.threshold", "2,2")
+    CarbonProperties.getInstance()
+      .addProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT, "mm/dd/yyyy")
+    // set threshold to 1MB in this test case
+    CarbonProperties.getInstance().addProperty("carbon.minor.compaction.size", "1")
+
+    sql("drop table if exists  minor_threshold")
+    sql("drop table if exists  tmp")
+
+    sql(
+      "CREATE TABLE IF NOT EXISTS minor_threshold (country String, ID Int, date Timestamp," +
+        " name String, phonetype String, serialname String, salary Int) STORED AS carbondata"
+    )
+    sql(
+      "CREATE TABLE IF NOT EXISTS tmp (country String, ID Int, date Timestamp," +
+        " name String, phonetype String, serialname String, salary Int) STORED AS carbondata"
+    )
+
+    val initframe = generateData(100000)
+    initframe.write
+      .format("carbondata")
+      .option("tablename", "tmp")
+      .mode(SaveMode.Overwrite)
+      .save()
+    // load 3 segments
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath2 + "' INTO TABLE minor_threshold  OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+
+    // insert a new segment(id is 3) data size exceed 1 MB
+    sql("insert into minor_threshold select * from tmp")
+
+    // load another 3 segments
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath2 + "' INTO TABLE minor_threshold  OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+    sql("LOAD DATA LOCAL INPATH '" + csvFilePath1 + "' INTO TABLE minor_threshold OPTIONS" +
+      "('DELIMITER'= ',', 'QUOTECHAR'= '\"')"
+    )
+
+    sql("show segments for table minor_threshold").show(100, false)
+    // do minor compaction
+    sql("alter table minor_threshold compact 'minor'"
+    )
+    // check segment 3 whose size exceed the limit should not be compacted
+    val carbonTable = CarbonMetadata.getInstance().getCarbonTable(
+      CarbonCommonConstants.DATABASE_DEFAULT_NAME, "minor_threshold")
+    val carbonTablePath = carbonTable.getMetadataPath
+    val segments = SegmentStatusManager.readLoadMetadata(carbonTablePath);
+    assertResult(SegmentStatus.SUCCESS)(segments(3).getSegmentStatus)
+    assertResult(100030)(sql("select count(*) from minor_threshold").collect().head.get(0))
+    // reset to 0
+    CarbonProperties.getInstance().addProperty("carbon.minor.compaction.size", "0")
+  }

Review comment:
       @ajantha-bhat OK




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532405537



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/table/CarbonDescribeFormattedCommand.scala
##########
@@ -191,6 +191,10 @@ private[sql] case class CarbonDescribeFormattedCommand(
           CarbonProperties.getInstance()
             .getProperty(CarbonCommonConstants.CARBON_MAJOR_COMPACTION_SIZE,
               CarbonCommonConstants.DEFAULT_CARBON_MAJOR_COMPACTION_SIZE)), ""),
+      (CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE.toUpperCase,
+        tblProps.getOrElse(CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE,
+          CarbonProperties.getInstance()
+            .getProperty(CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE, "0")), ""),

Review comment:
       @ajantha-bhat yes




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532405630



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE
+    if (tableProperties.get(tblPropName).isDefined) {
+      val minorCompactionSizeStr: String =
+        parsePropertyValueStringInMB(tableProperties(tblPropName))
+      try {
+        minorCompactionSize = Integer.parseInt(minorCompactionSizeStr)
+      } catch {
+        case e: NumberFormatException =>
+          throw new MalformedCarbonCommandException(s"Invalid $tblPropName value found: " +
+            s"$minorCompactionSizeStr, only int value greater than 0 is supported.")
+      }
+      if (minorCompactionSize < 0) {

Review comment:
       @ajantha-bhat ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532405803



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE
+    if (tableProperties.get(tblPropName).isDefined) {
+      val minorCompactionSizeStr: String =
+        parsePropertyValueStringInMB(tableProperties(tblPropName))
+      try {
+        minorCompactionSize = Integer.parseInt(minorCompactionSizeStr)
+      } catch {
+        case e: NumberFormatException =>
+          throw new MalformedCarbonCommandException(s"Invalid $tblPropName value found: " +
+            s"$minorCompactionSizeStr, only int value greater than 0 is supported.")

Review comment:
       @ajantha-bhat 0 is not supported, will add check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532406054



##########
File path: docs/dml-of-carbondata.md
##########
@@ -529,6 +529,10 @@ CarbonData DML statements are documented here,which includes:
   * Level 1: Merging of the segments which are not yet compacted.
   * Level 2: Merging of the compacted segments again to form a larger segment.
 
+  The segment whose data size exceed limit of carbon.minor.compaction.size will not be included in
+  minor compaction. If user want to control the size of segment included in minor compaction,
+  configure the property with appropriate value in MB, if not configure, will merge segments only
+  based on num of segments.

Review comment:
       @ajantha-bhat changed

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##########
@@ -998,6 +998,23 @@ public long getMajorCompactionSize() {
     return compactionSize;
   }
 
+  /**
+   * returns minor compaction size value from carbon properties or 0 if it is not valid or
+   * not configured
+   *
+   * @return compactionSize
+   */
+  public long getMinorCompactionSize() {
+    long compactionSize = 0;
+    try {
+      compactionSize = Long.parseLong(getProperty(

Review comment:
       @ajantha-bhat handled




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-735676431


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4971/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-735679565


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3216/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

akashrn5 commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r532795309



##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##########
@@ -998,6 +998,33 @@ public long getMajorCompactionSize() {
     return compactionSize;
   }
 
+  /**
+   * returns minor compaction size value from carbon properties or 0 if it is not valid or

Review comment:
       shouldn't it be -1?, please change it

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##########
@@ -998,6 +998,33 @@ public long getMajorCompactionSize() {
     return compactionSize;
   }
 
+  /**
+   * returns minor compaction size value from carbon properties or 0 if it is not valid or
+   * not configured
+   *
+   * @return compactionSize
+   */
+  public long getMinorCompactionSize() {
+    long compactionSize = -1;
+    try {
+      compactionSize = Long.parseLong(getProperty(
+              CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE, "0"));
+    } catch (NumberFormatException e) {
+      LOGGER.warn("The specified value for property "
+              + CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE + " is incorrect."
+              + " Correct value should be long value great than 1. Taking the default value -1" +
+              " which means not consider segment size in minor compaction.");
+    }
+    if (compactionSize <= 0 && compactionSize != -1) {
+      LOGGER.warn("The specified value for property "
+              + CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE + " is incorrect."

Review comment:
       same as above

##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##########
@@ -998,6 +998,33 @@ public long getMajorCompactionSize() {
     return compactionSize;
   }
 
+  /**
+   * returns minor compaction size value from carbon properties or 0 if it is not valid or
+   * not configured
+   *
+   * @return compactionSize
+   */
+  public long getMinorCompactionSize() {
+    long compactionSize = -1;
+    try {
+      compactionSize = Long.parseLong(getProperty(
+              CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE, "0"));
+    } catch (NumberFormatException e) {
+      LOGGER.warn("The specified value for property "
+              + CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE + " is incorrect."
+              + " Correct value should be long value great than 1. Taking the default value -1" +
+              " which means not consider segment size in minor compaction.");

Review comment:
       Please change as below:
   
   `Invalid value is configured for property CarbonCommonConstants.CARBON_MINOR_COMPACTION_SIZE, considering the default value -1 and not considering segment Size during minor compaction.`

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties

Review comment:
       remove this, not needed as variable name clearly tells what it is

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE

Review comment:
       please give a better variable name

##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -767,18 +784,27 @@ private static boolean isMergedSegment(String segName) {
   public static long getCompactionSize(CompactionType compactionType,
                                        CarbonLoadModel carbonLoadModel) {
     long compactionSize = 0;
+    Map<String, String> tblProps = carbonLoadModel.getCarbonDataLoadSchema()
+            .getCarbonTable().getTableInfo().getFactTable().getTableProperties();
     switch (compactionType) {
       case MAJOR:
         // default value is system level option
         compactionSize = CarbonProperties.getInstance().getMajorCompactionSize();
         // if table level option is identified, use it to overwrite system level option
-        Map<String, String> tblProps = carbonLoadModel.getCarbonDataLoadSchema()
-                .getCarbonTable().getTableInfo().getFactTable().getTableProperties();
         if (tblProps.containsKey(CarbonCommonConstants.TABLE_MAJOR_COMPACTION_SIZE)) {
           compactionSize = Long.parseLong(
                   tblProps.get(CarbonCommonConstants.TABLE_MAJOR_COMPACTION_SIZE));
         }
         break;
+      case MINOR:
+        // default value is system level option
+        compactionSize = CarbonProperties.getInstance().getMinorCompactionSize();

Review comment:
       you can move this to else block

##########
File path: docs/dml-of-carbondata.md
##########
@@ -529,6 +529,10 @@ CarbonData DML statements are documented here,which includes:
   * Level 1: Merging of the segments which are not yet compacted.
   * Level 2: Merging of the compacted segments again to form a larger segment.
 
+  The segment whose data size exceed limit of carbon.minor.compaction.size will not be included in
+  minor compaction. If user want to control the size of segment included in minor compaction,
+  configure the property with appropriate value in MB, if not configure, will merge segments only
+  based on number of segments.

Review comment:
       change to below
   
   `User can control the size of a segment to be included in the minor compaction by using carbon.minor.compaction.size. If not configured, minor compaction will consider the segments based on carbon.compaction.level.threshold by neglecting the size of each segment`

##########
File path: docs/dml-of-carbondata.md
##########
@@ -529,6 +529,10 @@ CarbonData DML statements are documented here,which includes:
   * Level 1: Merging of the segments which are not yet compacted.
   * Level 2: Merging of the compacted segments again to form a larger segment.
 
+  The segment whose data size exceed limit of carbon.minor.compaction.size will not be included in

Review comment:
       ```suggestion
     The segment whose data size exceed the limit of carbon.minor.compaction.size will not be included in
   ```

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE
+    if (tableProperties.get(tblPropName).isDefined) {
+      val minorCompactionSizeStr: String =
+        parsePropertyValueStringInMB(tableProperties(tblPropName))
+      try {
+        minorCompactionSize = Integer.parseInt(minorCompactionSizeStr)

Review comment:
       no need of 306 and 307, directly add that in 309 inside parseInt

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala
##########
@@ -292,6 +293,33 @@ object CommonUtil {
     }
   }
 
+  /**
+   * This method will validate the minor compaction size specified by the user
+   * the property is used while doing minor compaction
+   *
+   * @param tableProperties
+   */
+  def validateMinorCompactionSize(tableProperties: Map[String, String]): Unit = {
+    var minorCompactionSize: Integer = 0
+    val tblPropName = CarbonCommonConstants.TABLE_MINOR_COMPACTION_SIZE
+    if (tableProperties.get(tblPropName).isDefined) {
+      val minorCompactionSizeStr: String =
+        parsePropertyValueStringInMB(tableProperties(tblPropName))
+      try {
+        minorCompactionSize = Integer.parseInt(minorCompactionSizeStr)
+      } catch {
+        case e: NumberFormatException =>
+          throw new MalformedCarbonCommandException(s"Invalid $tblPropName value found: " +
+            s"$minorCompactionSizeStr, only int value greater than 0 is supported.")
+      }
+      if (minorCompactionSize <= 0) {
+        throw new MalformedCarbonCommandException(s"Invalid $tblPropName value found: " +
+          s"$minorCompactionSizeStr, only int value greater than 0 is supported.")

Review comment:
       Please change like below
   
   `Invalid value $minorCompactionSizeStr configured for $tblPropName. Please consider configuring value greater than 0`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on a change in pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on a change in pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#discussion_r533018015



##########
File path: core/src/main/java/org/apache/carbondata/core/util/CarbonProperties.java
##########
@@ -998,6 +998,33 @@ public long getMajorCompactionSize() {
     return compactionSize;
   }
 
+  /**
+   * returns minor compaction size value from carbon properties or 0 if it is not valid or

Review comment:
       @akashrn5 yes, should be -1, changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Zhangshunyu commented on pull request #4020: [CARBONDATA-4054] Support data size control for minor compaction

GitBox
In reply to this post by GitBox

Zhangshunyu commented on pull request #4020:
URL: https://github.com/apache/carbondata/pull/4020#issuecomment-736162189


   @akashrn5 @ajantha-bhat  fixed all review comments, pls check.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


123