[GitHub] [carbondata] Indhumathi27 opened a new pull request #4130: [WIP] Local sort Partition Load and Compaction fix

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 opened a new pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox

Indhumathi27 opened a new pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130


    ### Why is this PR needed?
   
   
    ### What changes were proposed in this PR?
   
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-832643917


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/3560/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-832644656


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5305/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-832757354


   Build Failed  with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/3561/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-832758837


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5306/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-832860032


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5307/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-832861472


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/3562/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-838892337


   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5338/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [WIP] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-838893787


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/3593/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

Indhumathi27 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-839505329


   retest this please


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-839568283


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/3597/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-839574642


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5342/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#discussion_r634921359



##########
File path: core/src/main/java/org/apache/carbondata/core/constants/CarbonCommonConstants.java
##########
@@ -974,6 +974,16 @@ private CarbonCommonConstants() {
 
   public static final String LOAD_GLOBAL_SORT_PARTITIONS_DEFAULT = "0";
 
+  /**
+   * When enabled, tasks launched for Local sort partition load will be based on one node one task.
+   * Compaction will be performed based on task level for a partition.

Review comment:
       please also add why it is required and how it impacts memory and performance in the description as well as in the document.

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/standardpartition/StandardPartitionTableCompactionTestCase.scala
##########
@@ -94,6 +94,45 @@ class StandardPartitionTableCompactionTestCase extends QueryTest with BeforeAndA
       sql("select empno, empname, designation, doj, workgroupcategory, workgroupcategoryname, deptno, deptname, projectcode, projectjoindate, projectenddate, attendance, utilization, salary from originTable where workgroupcategory=1 and empname='arvind' and designation='SE' order by empno"))
   }
 
+  test("data load and compaction for local sort partition table based on task id ") {
+    sql("drop table if exists partitionthree")
+    CarbonProperties.getInstance()
+      .addProperty(CarbonCommonConstants.CARBON_PARTITION_DATA_BASED_ON_TASK_LEVEL, "TRUE")
+    try {
+      sql("""
+            | CREATE TABLE partitionthree (empno int, doj Timestamp,
+            |  workgroupcategoryname String, deptno int, deptname String,
+            |  projectcode int, projectjoindate Timestamp, projectenddate Timestamp,attendance int,
+            |  utilization int,salary int)
+            | PARTITIONED BY (workgroupcategory int, empname String, designation String)
+            | STORED AS carbondata
+            | tblproperties('sort_scope'='local_sort', 'sort_columns'='deptname,empname')
+      """.stripMargin)
+      sql(s"""LOAD DATA local inpath '$resourcesPath/data.csv' INTO TABLE partitionthree OPTIONS('DELIMITER'= ',', 'QUOTECHAR'= '"')""".stripMargin)

Review comment:
       can add some validation points like a number of files count should be node count.
   I don't see any validation for the new feature added in the testcase.

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
##########
@@ -712,10 +731,63 @@ class CarbonMergerRDD[K, V](
     }
   }
 
+  /**
+   * Group the data files based on task Id per partition
+   */
+  private def getTaskNumberBasedOnID(
+      split: CarbonInputSplit,
+      partitionTaskMap: util.Map[PartitionSpec, util.HashMap[String, Int]],
+      counter: AtomicInteger): String = {
+    if (carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.isHivePartitionTable) {
+      val path = split.getPath.getParent
+      val partTask =
+        carbonMergerMapping.currentPartitions.get.find(p => p.getLocation.equals(path)) match {
+          case Some(part) => part
+          case None =>
+            throw new UnsupportedOperationException("Cannot do compaction on dropped partition")
+        }
+      var task: String = null
+      // split task Id is a combination of segment id(3 digits), task id (6 digits) and
+      // partition number (6 digits). Get the task Id from split task id
+      val partitionTaskId = split.taskId.substring(3, 9).toInt - Math.pow(10, 5).toInt

Review comment:
       If someone changes task Id logic, this may go for a toss and no error will be reported.
   May be write a Util method in `getTaskIdfromUniqueNumber` below `generateUniqueNumber` for better maintainability.

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
##########
@@ -712,10 +731,63 @@ class CarbonMergerRDD[K, V](
     }
   }
 
+  /**
+   * Group the data files based on task Id per partition
+   */
+  private def getTaskNumberBasedOnID(
+      split: CarbonInputSplit,
+      partitionTaskMap: util.Map[PartitionSpec, util.HashMap[String, Int]],
+      counter: AtomicInteger): String = {
+    if (carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable.isHivePartitionTable) {
+      val path = split.getPath.getParent
+      val partTask =
+        carbonMergerMapping.currentPartitions.get.find(p => p.getLocation.equals(path)) match {
+          case Some(part) => part
+          case None =>
+            throw new UnsupportedOperationException("Cannot do compaction on dropped partition")
+        }
+      var task: String = null
+      // split task Id is a combination of segment id(3 digits), task id (6 digits) and
+      // partition number (6 digits). Get the task Id from split task id
+      val partitionTaskId = split.taskId.substring(3, 9).toInt - Math.pow(10, 5).toInt
+      if (partitionTaskMap.isEmpty || partitionTaskMap.get(partTask) == null) {
+        if (partitionTaskMap.isEmpty) {
+          task = counter.toString
+        } else {
+          task = counter.incrementAndGet().toString
+        }
+        val partitionTaskList = new util.HashMap[String, Int]()
+        partitionTaskList.put(task, partitionTaskId)
+        partitionTaskMap.put(partTask, partitionTaskList)
+      } else {
+        val taskMap = partitionTaskMap.get(partTask)
+        if (taskMap.containsValue(partitionTaskId)) {
+          task = taskMap.asScala.find(_._2.equals(partitionTaskId)).get._1
+        } else {
+          task = counter.incrementAndGet().toString
+          taskMap.put(task, partitionTaskId)
+        }
+        partitionTaskMap.put(partTask, taskMap)
+      }
+      task
+    } else {
+      split.taskId
+    }
+  }
+
   private def getPartitionNamesFromTask(taskId: String,
-      partitionTaskMap: util.Map[PartitionSpec, String]): Option[PartitionSpec] = {
+      partitionTaskMap: util.Map[PartitionSpec, String],
+      partitionTaskMapBasedOnId: util.Map[PartitionSpec, util.HashMap[String, Int]],

Review comment:
       If `partitionTaskMapBasedOnId` is not empty we can go with new flow else old flow, why other two arguments required ?




--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-847202605


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12602/job/ApacheCarbon_PR_Builder_2.4.5/3674/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-847204303


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12602/job/ApacheCarbonPRBuilder2.3/5419/
   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

Indhumathi27 commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-847581804


   @ajantha-bhat Fixed all comments. Please review


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130#issuecomment-847591738


   LGTM


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] asfgit closed pull request #4130: [CARBONDATA-4183] Local sort Partition Load and Compaction fix

GitBox
In reply to this post by GitBox

asfgit closed pull request #4130:
URL: https://github.com/apache/carbondata/pull/4130


   


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]