[GitHub] [carbondata] vikramahuja1001 opened a new pull request #4072: [WIP] Clean files phase2

classic Classic list List threaded Threaded
180 messages Options
1 ... 56789
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox

akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590016222



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +106,53 @@ object DataTrashManager {
     }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is used before and after
+   * clean files operation to check how much space is actually freed, during the operation.
+   */
+  def getSizeSnapshot(carbonTable: CarbonTable): Long = {
+    val metadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    var size: Long = 0
+    val segmentFileLocation = CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+    if (FileFactory.isFileExist(segmentFileLocation)) {
+      size += FileFactory.getDirectorySize(segmentFileLocation)
+    }
+    metadataDetails.foreach(oneLoad =>
+      if (oneLoad.getVisibility.toBoolean) {
+        size += calculateSegmentSizeForOneLoad(carbonTable, oneLoad, metadataDetails)
+      }
+    )
+    size
+  }
+
+  /**
+   * Method to handle the Clean files dry run operation
+   */
+  def cleanFilesDryRunOperation (
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      showStats: Boolean): (Long, Long) = {
+    // get size freed from the trash folder
+    val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete,
+        isDryRun = true, showStats)
+    // get size that will be deleted (MFD, COmpacted, Inprogress segments)
+    val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, isForceDelete,
+      cleanStaleInProgress)
+    (trashFolderSizeStats._1 + expiredSegmentsSizeStats._1, trashFolderSizeStats._2 +
+        expiredSegmentsSizeStats._2)
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean,
+      isDryRun: Boolean, showStats: Boolean): (Long, Long) = {
     if (isForceDelete) {
       // empty the trash folder
-      TrashUtil.emptyTrash(carbonTable.getTablePath)
+      val a = TrashUtil.emptyTrash(carbonTable.getTablePath, isDryRun, showStats)

Review comment:
       give a proper variable name here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590016320



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +106,53 @@ object DataTrashManager {
     }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is used before and after
+   * clean files operation to check how much space is actually freed, during the operation.
+   */
+  def getSizeSnapshot(carbonTable: CarbonTable): Long = {
+    val metadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    var size: Long = 0
+    val segmentFileLocation = CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+    if (FileFactory.isFileExist(segmentFileLocation)) {
+      size += FileFactory.getDirectorySize(segmentFileLocation)
+    }
+    metadataDetails.foreach(oneLoad =>
+      if (oneLoad.getVisibility.toBoolean) {
+        size += calculateSegmentSizeForOneLoad(carbonTable, oneLoad, metadataDetails)
+      }
+    )
+    size
+  }
+
+  /**
+   * Method to handle the Clean files dry run operation
+   */
+  def cleanFilesDryRunOperation (
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      showStats: Boolean): (Long, Long) = {
+    // get size freed from the trash folder
+    val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete,
+        isDryRun = true, showStats)
+    // get size that will be deleted (MFD, COmpacted, Inprogress segments)
+    val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, isForceDelete,
+      cleanStaleInProgress)
+    (trashFolderSizeStats._1 + expiredSegmentsSizeStats._1, trashFolderSizeStats._2 +
+        expiredSegmentsSizeStats._2)
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean,
+      isDryRun: Boolean, showStats: Boolean): (Long, Long) = {
     if (isForceDelete) {
       // empty the trash folder
-      TrashUtil.emptyTrash(carbonTable.getTablePath)
+      val a = TrashUtil.emptyTrash(carbonTable.getTablePath, isDryRun, showStats)
+      (a.head, a(1))
     } else {
       // clear trash based on timestamp
-      TrashUtil.deleteExpiredDataFromTrash(carbonTable.getTablePath)
+      val a = TrashUtil.deleteExpiredDataFromTrash(carbonTable.getTablePath, isDryRun, showStats)

Review comment:
       same as above




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-793536218


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/5120/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-793543867


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3361/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590142619



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +106,53 @@ object DataTrashManager {
     }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is used before and after
+   * clean files operation to check how much space is actually freed, during the operation.
+   */
+  def getSizeSnapshot(carbonTable: CarbonTable): Long = {
+    val metadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    var size: Long = 0
+    val segmentFileLocation = CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+    if (FileFactory.isFileExist(segmentFileLocation)) {
+      size += FileFactory.getDirectorySize(segmentFileLocation)
+    }
+    metadataDetails.foreach(oneLoad =>
+      if (oneLoad.getVisibility.toBoolean) {
+        size += calculateSegmentSizeForOneLoad(carbonTable, oneLoad, metadataDetails)
+      }
+    )
+    size
+  }
+
+  /**
+   * Method to handle the Clean files dry run operation
+   */
+  def cleanFilesDryRunOperation (
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      showStats: Boolean): (Long, Long) = {
+    // get size freed from the trash folder
+    val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete,
+        isDryRun = true, showStats)
+    // get size that will be deleted (MFD, COmpacted, Inprogress segments)
+    val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, isForceDelete,
+      cleanStaleInProgress)
+    (trashFolderSizeStats._1 + expiredSegmentsSizeStats._1, trashFolderSizeStats._2 +
+        expiredSegmentsSizeStats._2)
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean,
+      isDryRun: Boolean, showStats: Boolean): (Long, Long) = {
     if (isForceDelete) {
       // empty the trash folder
-      TrashUtil.emptyTrash(carbonTable.getTablePath)
+      val a = TrashUtil.emptyTrash(carbonTable.getTablePath, isDryRun, showStats)

Review comment:
       done

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +106,53 @@ object DataTrashManager {
     }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = {
+  /**
+   * Checks the size of the segment files as well as datafiles, this method is used before and after
+   * clean files operation to check how much space is actually freed, during the operation.
+   */
+  def getSizeSnapshot(carbonTable: CarbonTable): Long = {
+    val metadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    var size: Long = 0
+    val segmentFileLocation = CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath)
+    if (FileFactory.isFileExist(segmentFileLocation)) {
+      size += FileFactory.getDirectorySize(segmentFileLocation)
+    }
+    metadataDetails.foreach(oneLoad =>
+      if (oneLoad.getVisibility.toBoolean) {
+        size += calculateSegmentSizeForOneLoad(carbonTable, oneLoad, metadataDetails)
+      }
+    )
+    size
+  }
+
+  /**
+   * Method to handle the Clean files dry run operation
+   */
+  def cleanFilesDryRunOperation (
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      showStats: Boolean): (Long, Long) = {
+    // get size freed from the trash folder
+    val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete,
+        isDryRun = true, showStats)
+    // get size that will be deleted (MFD, COmpacted, Inprogress segments)
+    val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, isForceDelete,
+      cleanStaleInProgress)
+    (trashFolderSizeStats._1 + expiredSegmentsSizeStats._1, trashFolderSizeStats._2 +
+        expiredSegmentsSizeStats._2)
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean,
+      isDryRun: Boolean, showStats: Boolean): (Long, Long) = {
     if (isForceDelete) {
       // empty the trash folder
-      TrashUtil.emptyTrash(carbonTable.getTablePath)
+      val a = TrashUtil.emptyTrash(carbonTable.getTablePath, isDryRun, showStats)
+      (a.head, a(1))
     } else {
       // clear trash based on timestamp
-      TrashUtil.deleteExpiredDataFromTrash(carbonTable.getTablePath)
+      val a = TrashUtil.deleteExpiredDataFromTrash(carbonTable.getTablePath, isDryRun, showStats)

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590182511



##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/impl/FileFactory.java
##########
@@ -297,6 +297,10 @@ public static boolean deleteFile(String filePath) throws IOException {
     return getCarbonFile(filePath).deleteFile();
   }
 
+  public static boolean deleteFile(CarbonFile carbonFile) throws IOException {

Review comment:
       removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590183560



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1125,13 +1165,32 @@ public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
             CarbonLockUtil.fileUnlock(carbonTableStatusLock, LockUsage.TABLE_STATUS_LOCK);
           }
           if (updateCompletionStatus) {
-            DeleteLoadFolders
+            long[] cleanFileSizeFreed = DeleteLoadFolders
                 .physicalFactAndMeasureMetadataDeletion(carbonTable, newAddedLoadHistoryList,
                   isForceDeletion, partitionSpecs, cleanStaleInprogress);
+            sizeFreed += cleanFileSizeFreed[0];
+            trashSizeRemaining += cleanFileSizeFreed[1];
+          }
+        }
+      } else {
+        try {
+          for (LoadMetadataDetails oneLoad : metadataDetails) {
+            if (isExpiredSegment(oneLoad, carbonTable.getAbsoluteTableIdentifier())) {
+              if (!carbonTable.isHivePartitionTable()) {
+                trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(
+                  carbonTable.getTablePath(), oneLoad.getLoadName()));
+              } else {
+                trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+                  metadataDetails, partitionSpecs);
+              }
+            }
           }
+        } catch (Exception e) {
+          LOG.error("Unable to calculate size of garbage data", e);
         }
       }
     }
+    return new long[]{sizeFreed, trashSizeRemaining};

Review comment:
       changed logic, removed dry run flow with current clean files flow and made a different flow to handle that part




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590184637



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1297,4 +1356,37 @@ public static TableStatusReturnTuple separateVisibleAndInvisibleSegments(
       return new HashMap<>(0);
     }
   }
+
+  public static long partitionTableSegmentSize(CarbonTable carbonTable, LoadMetadataDetails
+      oneLoad, LoadMetadataDetails[] loadMetadataDetails, List<PartitionSpec>
+      partitionSpecs) throws Exception {
+    long size = 0;
+    SegmentFileStore fileStore = new SegmentFileStore(carbonTable.getTablePath(), oneLoad
+        .getSegmentFile());
+    List<String> indexOrMergeFiles = fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
+        FileFactory.getConfiguration());
+    Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap();

Review comment:
       instead of this, using the metadata to read the data and the index size, it is a very fast operation and will not require any file reading other than the tablestatus file.

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFilesCommandPartitionTable.scala
##########
@@ -65,14 +65,19 @@ class TestCleanFilesCommandPartitionTable extends QueryTest with BeforeAndAfterA
     loadData()
     sql(s"""ALTER TABLE CLEANTEST COMPACT "MINOR" """)
     loadData()
+    sql(s"CLEAN FILES FOR TABLE cleantest DRYRUN").show()
+    sql(s"CLEAN FILES FOR TABLE cleantest").show()

Review comment:
       changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590184901



##########
File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1112,11 +1112,15 @@ public static void cleanSegments(CarbonTable table,
     cleanSegments(table, details, partitionSpecs, forceDelete);
   }
 
-  public static void deleteSegmentFile(String tablePath, Segment segment) throws Exception {
+  public static long deleteSegmentFile(String tablePath, Segment segment) throws Exception {
     String segmentFilePath =
         CarbonTablePath.getSegmentFilePath(tablePath, segment.getSegmentFileName());
     // Deletes the physical segment file
-    FileFactory.deleteFile(segmentFilePath);
+    CarbonFile carbonSegmentFile = FileFactory.getCarbonFile(segmentFilePath);
+    long sizeFreed  = carbonSegmentFile.getSize();
+    FileFactory.deleteFile(carbonSegmentFile);

Review comment:
       changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590185083



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/events/CleanFilesEvents.scala
##########
@@ -26,5 +26,6 @@ case class CleanFilesPreEvent(carbonTable: CarbonTable, sparkSession: SparkSessi
 case class CleanFilesPostEvent(
     carbonTable: CarbonTable,
     sparkSession: SparkSession,
-    options: Map[String, String])
+    options: Map[String, String],
+    dryRun: Boolean)

Review comment:
       changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590185386



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -112,13 +141,91 @@ object DataTrashManager {
       carbonTable: CarbonTable,
       isForceDelete: Boolean,
       cleanStaleInProgress: Boolean,
-      partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = {
+      partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
     val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull
-    SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
+    val sizeStatistics = SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
       isForceDelete, partitionSpecs, cleanStaleInProgress, true)
     if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) {
       SegmentFileStore.cleanSegments(carbonTable, partitionSpecs, isForceDelete)
     }
+    sizeStatistics
+  }
+
+  private def dryRunOnExpiredSegments(
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
+    var sizeFreed: Long = 0
+    var trashSizeRemaining: Long = 0
+    val loadMetadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) {
+      loadMetadataDetails.foreach { oneLoad =>
+        val segmentFilePath = CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath,
+          oneLoad.getSegmentFile)
+        if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete, cleanStaleInProgress)) {
+          // No need to consider physical data for external segments, only consider metadata.
+          if (oneLoad.getPath() == null || oneLoad.getPath().equalsIgnoreCase("NA")) {
+            if (!carbonTable.isHivePartitionTable) {
+              sizeFreed += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable
+                  .getTablePath, oneLoad.getLoadName))
+            } else {
+              sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad, loadMetadataDetails,
+                partitionSpecsOption)
+            }
+          }
+          sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize
+        } else {
+          if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable
+              .getAbsoluteTableIdentifier)) {
+            if (!carbonTable.isHivePartitionTable) {
+              trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(
+                  carbonTable.getTablePath, oneLoad.getLoadName))
+            } else {
+              trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+                  loadMetadataDetails, partitionSpecsOption)
+            }
+            trashSizeRemaining += FileFactory.getCarbonFile(segmentFilePath).getSize
+          }
+        }
+      }
+    }
+    Seq(sizeFreed, trashSizeRemaining)
+  }
+
+  def partitionTableSegmentSize( carbonTable: CarbonTable, oneLoad: LoadMetadataDetails,
+      loadMetadataDetails: Array[LoadMetadataDetails], partitionSpecsOption:
+      Option[Seq[PartitionSpec]]) : Long = {
+    var segmentSize: Long = 0

Review comment:
       changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590186070



##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFilesCommandPartitionTable.scala
##########
@@ -186,7 +193,9 @@ class TestCleanFilesCommandPartitionTable extends QueryTest with BeforeAndAfterA
     removeSegmentEntryFromTableStatusFile(CarbonEnv.getCarbonTable(Some("default"), "cleantest")(
       sqlContext.sparkSession), "2")
 
-    sql(s"CLEAN FILES FOR TABLE CLEANTEST").show()
+    val df1 = sql(s"CLEAN FILES FOR TABLE CLEANTEST DRYRUN")

Review comment:
       added code to check the segment size in one of the test case and checking it with clean files result and the dry run result. The test case will only pass when all three result are the same




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590187262



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +101,28 @@ object DataTrashManager {
     }
   }
 
-  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = {
+  def cleanFilesDryRunOperation (
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      partitionSpecs: Option[Seq[PartitionSpec]] = None): Seq[Long] = {
+    // get size freed from the trash folder
+    val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete, isDryRun = true)
+    // get size that will be deleted (MFD, COmpacted, Inprogress segments)
+    val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, isForceDelete,
+      cleanStaleInProgress, partitionSpecs)
+    Seq(trashFolderSizeStats.head + expiredSegmentsSizeStats.head, trashFolderSizeStats(1) +
+        expiredSegmentsSizeStats(1))
+  }
+
+  private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean,
+      isDryRun: Boolean): Seq[Long] = {

Review comment:
       dry run option can be ran along force option, in the force option we will have to delete all the data of the trash folder immediately. So, it will just return the complete trash size in case of dryrun = true and forceDelete = true, without deleting anything




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590189418



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/parser/CarbonSpark2SqlParser.scala
##########
@@ -513,12 +513,13 @@ class CarbonSpark2SqlParser extends CarbonDDLSqlParser {
 
   protected lazy val cleanFiles: Parser[LogicalPlan] =
     CLEAN ~> FILES ~> FOR ~> TABLE ~> (ident <~ ".").? ~ ident ~
-      (OPTIONS ~> "(" ~> repsep(options, ",") <~ ")").? <~ opt(";") ^^ {
-      case databaseName ~ tableName ~ optionList =>
+      (OPTIONS ~> "(" ~> repsep(options, ",") <~ ")").? ~ opt(DRYRUN) <~ opt(";") ^^ {

Review comment:
       changed, added it in the option




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590191260



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1072,7 +1097,22 @@ public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
                 isUpdateRequired(isForceDeletion, carbonTable,
                     identifier, details, cleanStaleInprogress);
             if (!tuple2.isUpdateRequired) {
-              return;
+              try {
+                for (LoadMetadataDetails oneLoad : details) {
+                  if (isExpiredSegment(oneLoad, carbonTable.getAbsoluteTableIdentifier())) {
+                    if (!carbonTable.isHivePartitionTable()) {
+                      trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath
+                        .getSegmentPath(carbonTable.getTablePath(), oneLoad.getLoadName()));
+                    } else {
+                      trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+                        details, partitionSpecs);
+                    }
+                  }
+                }
+              } catch (Exception e) {
+                LOG.error("Unable to calculate size of garbage data", e);
+              }
+              return new long[]{sizeFreed, trashSizeRemaining};

Review comment:
       changed code, dry run is separate from the current clean files code and not mixed with the existing code




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590192283



##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1297,4 +1359,37 @@ public static TableStatusReturnTuple separateVisibleAndInvisibleSegments(
       return new HashMap<>(0);
     }
   }
+
+  public static long partitionTableSegmentSize(CarbonTable carbonTable, LoadMetadataDetails

Review comment:
       done, changed this method completely, now reading the data using tablestatus by reading the data and the index size




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590192598



##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -112,13 +141,91 @@ object DataTrashManager {
       carbonTable: CarbonTable,
       isForceDelete: Boolean,
       cleanStaleInProgress: Boolean,
-      partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = {
+      partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
     val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull
-    SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
+    val sizeStatistics = SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
       isForceDelete, partitionSpecs, cleanStaleInProgress, true)
     if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) {
       SegmentFileStore.cleanSegments(carbonTable, partitionSpecs, isForceDelete)
     }
+    sizeStatistics
+  }
+
+  private def dryRunOnExpiredSegments(
+      carbonTable: CarbonTable,
+      isForceDelete: Boolean,
+      cleanStaleInProgress: Boolean,
+      partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
+    var sizeFreed: Long = 0
+    var trashSizeRemaining: Long = 0
+    val loadMetadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+    if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) {
+      loadMetadataDetails.foreach { oneLoad =>
+        val segmentFilePath = CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath,
+          oneLoad.getSegmentFile)
+        if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete, cleanStaleInProgress)) {
+          // No need to consider physical data for external segments, only consider metadata.
+          if (oneLoad.getPath() == null || oneLoad.getPath().equalsIgnoreCase("NA")) {
+            if (!carbonTable.isHivePartitionTable) {
+              sizeFreed += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable
+                  .getTablePath, oneLoad.getLoadName))
+            } else {
+              sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad, loadMetadataDetails,
+                partitionSpecsOption)
+            }
+          }
+          sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize
+        } else {
+          if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable
+              .getAbsoluteTableIdentifier)) {
+            if (!carbonTable.isHivePartitionTable) {
+              trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(

Review comment:
       changed logic




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590203265



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##########
@@ -37,11 +38,24 @@ case class CarbonCleanFilesCommand(
     databaseNameOp: Option[String],
     tableName: String,
     options: Map[String, String] = Map.empty,
+    dryRun: Boolean,
     isInternalCleanCall: Boolean = false)
   extends DataCommand {
 
   val LOGGER: Logger = LogServiceFactory.getLogService(this.getClass.getCanonicalName)
 
+  override def output: Seq[AttributeReference] = {
+    if (dryRun) {
+      Seq(
+        AttributeReference("Size Freed", LongType, nullable = false)(),
+        AttributeReference("Trash Data Remaining", LongType, nullable = false)())
+    } else {
+      Seq(
+        AttributeReference("Size Freed", LongType, nullable = false)(),
+        AttributeReference("Trash Data Remaining", LongType, nullable = false)())
+    }

Review comment:
       Changed the logic, now both the blocks are different




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-793734908


   >  i agree with @ajantha-bhat , may be use loadmetadata details data and index size only and for update files you can use some logic, please check
   
   @akashrn5 , the tablestatus file is used to check the data size and the index size of each segment and in the case of update/delete checking the getUpdateDeltaStartTimestamp and getUpdateDeltaEndTimestamp using segmentUpdateStatusManager to getDeleteDeltaFIlesList and getting their sizes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r590246840



##########
File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1133,17 +1133,23 @@ public static void deleteSegment(String tablePath, Segment segment,
     List<String> indexOrMergeFiles = fileStore.readIndexFiles(SegmentStatus.SUCCESS, true,
         FileFactory.getConfiguration());
     Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap();
+    List<String> deletedFiles = new ArrayList<>();

Review comment:
       We need  to print every file when it is deleted so as to check any unnecessary deletion does not happen, the whole purpose is defeated if we delete a file and do not log it's name




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


1 ... 56789