CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781209057 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5492/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781284062 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3730/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781286159 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5494/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781561697 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3327/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781562499 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/5084/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
akashrn5 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578921513 ########## File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java ########## @@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, Segment segment, Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap(); for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) { FileFactory.deleteFile(entry.getKey()); + LOGGER.info("File deleted after clean files operation: " + entry.getKey()); for (String file : entry.getValue()) { String[] deltaFilePaths = updateStatusManager.getDeleteDeltaFilePath(file, segment.getSegmentNo()); for (String deltaFilePath : deltaFilePaths) { FileFactory.deleteFile(deltaFilePath); + LOGGER.info("File deleted after clean files operation: " + deltaFilePath); Review comment: instead of logging each file name for every delete which will increase these logs when many delta files are there, once the loop completes, you can log once for all files in line 1147, along with actual block path or else you can say delete the the block file(print block file name) and the corresponding delta files as filestamp will same i guess. Just check once and add ########## File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java ########## @@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, Segment segment, Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap(); for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) { FileFactory.deleteFile(entry.getKey()); + LOGGER.info("File deleted after clean files operation: " + entry.getKey()); Review comment: ```suggestion LOGGER.info("Deleted file: " + entry.getKey() + ", on clean files" ); ``` can do same for all ########## File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java ########## @@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> filesToCopy, * The below method deletes timestamp subdirectories in the trash folder which have expired as * per the user defined retention time */ - public static void deleteExpiredDataFromTrash(String tablePath) { + public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean isDryRun) { Review comment: update the method comment based on method signature changes and return values, follow the same for other also if changed ########## File path: docs/clean-files.md ########## @@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete Marked for delete, Com ``` CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 'force'='true') - ``` \ No newline at end of file + ``` +### DRY RUN OPTION +Clean files also support a dry run option which will let the user know how much space fill we freed +during the actual clean files operation. The dry run operation will not delete any data but will just give +size bases statistics to the data. Dry run operation will return two columns where the first will +show how much space will be freed by that clean files operation and the second column will show the +remaining stale data(data which can be deleted but has not yet expired as per the ```max.query.execution.time``` and ``` carbon.trash.retention.days``` values +). By default the value of ```dryrun``` option is ```false```. + +Dry Run Operation is supported with four types of commands: + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('dryrun'='true') + ``` + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('force'='true', 'dryrun'='true') + ``` + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true','dryrun'='true') + ``` + + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 'force'='true','dryrun'='true') + ``` + +**NOTE**: + * Since the dry run operation will calculate size and will access File level API's, the operation can + be a costly and a time consuming operation in case of tables with large number of segments. Review comment: here better to add point of when dry run is true, other options doesnt matter except force = true, and i hope you have handled this in code ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ########## @@ -983,7 +983,28 @@ public static boolean isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden } } - private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) { + public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, AbsoluteTableIdentifier + absoluteTableIdentifier) { + boolean result = false; Review comment: rename to `isExpiredSegment` ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ########## @@ -983,7 +983,28 @@ public static boolean isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden } } - private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) { + public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, AbsoluteTableIdentifier + absoluteTableIdentifier) { + boolean result = false; + if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED || oneLoad.getSegmentStatus() == + SegmentStatus.MARKED_FOR_DELETE) { + return true; + } else if (oneLoad.getSegmentStatus() == SegmentStatus.INSERT_IN_PROGRESS || oneLoad + .getSegmentStatus() == SegmentStatus.INSERT_OVERWRITE_IN_PROGRESS) { + // check if lock can be acquired + ICarbonLock segmentLock = CarbonLockFactory.getCarbonLockObj(absoluteTableIdentifier, + CarbonTablePath.addSegmentPrefix(oneLoad.getLoadName()) + LockUsage.LOCK); + if (segmentLock.lockWithRetries()) { + result = true; + segmentLock.unlock(); Review comment: use `CarbonLockUtil.fileUnlock` method as it logs also, please follow other places if taking lock and better to have it in finally block ########## File path: core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java ########## @@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable carbonTable) // Deleting the stale Segment folders and the segment file. try { CarbonUtil.deleteFoldersAndFiles(segmentPath); + LOGGER.info("Deleted the segment folder :" + segmentPath.getAbsolutePath() + " after" + + " moving it to the trash folder"); // delete the segment file as well FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(), staleSegmentFile)); + LOGGER.info("Deleted stale segment file after moving it to the trash folder :" + + staleSegmentFile); Review comment: ```suggestion deleted stale segment file <segment_file_name> after moving it to the trash folder ``` ########## File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java ########## @@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> filesToCopy, * The below method deletes timestamp subdirectories in the trash folder which have expired as * per the user defined retention time */ - public static void deleteExpiredDataFromTrash(String tablePath) { + public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean isDryRun) { CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath .getTrashFolderPath(tablePath)); + long sizeFreed = 0; + long trashFolderSize = 0; // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp. try { if (trashFolder.isFileExist()) { + trashFolderSize = FileFactory.getDirectorySize(trashFolder.getAbsolutePath()); Review comment: i think you should not calculate size when the statistics is false, so calculate only when it is true else it will affect clean files time, please take care not to impact time of clean files ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ########## @@ -983,7 +983,28 @@ public static boolean isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden } } - private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) { + public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, AbsoluteTableIdentifier + absoluteTableIdentifier) { + boolean result = false; + if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED || oneLoad.getSegmentStatus() == + SegmentStatus.MARKED_FOR_DELETE) { + return true; Review comment: here also assign to the boolean variable as its just checking for one segment ########## File path: core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java ########## @@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable carbonTable) // Deleting the stale Segment folders and the segment file. try { CarbonUtil.deleteFoldersAndFiles(segmentPath); + LOGGER.info("Deleted the segment folder :" + segmentPath.getAbsolutePath() + " after" + + " moving it to the trash folder"); // delete the segment file as well FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(), staleSegmentFile)); + LOGGER.info("Deleted stale segment file after moving it to the trash folder :" + + staleSegmentFile); for (String duplicateStaleSegmentFile : redundantSegmentFile) { if (DataFileUtil.getSegmentNoFromSegmentFile(duplicateStaleSegmentFile) .equals(segmentNumber)) { FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable .getTablePath(), duplicateStaleSegmentFile)); + LOGGER.info("Deleted redundant segment file :" + duplicateStaleSegmentFile); Review comment: can log together after loop, all the deleted files and also please check all places try to follow the same, this is just to make logs cleaner ########## File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java ########## @@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> filesToCopy, * The below method deletes timestamp subdirectories in the trash folder which have expired as * per the user defined retention time */ - public static void deleteExpiredDataFromTrash(String tablePath) { + public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean isDryRun) { CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath .getTrashFolderPath(tablePath)); + long sizeFreed = 0; + long trashFolderSize = 0; // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp. try { if (trashFolder.isFileExist()) { + trashFolderSize = FileFactory.getDirectorySize(trashFolder.getAbsolutePath()); CarbonFile[] timestampFolderList = trashFolder.listFiles(); + List<CarbonFile> filesToDelete = new ArrayList<>(); for (CarbonFile timestampFolder : timestampFolderList) { // If the timeStamp at which the timeStamp subdirectory has expired as per the user // defined value, delete the complete timeStamp subdirectory - if (timestampFolder.isDirectory() && isTrashRetentionTimeoutExceeded(Long - .parseLong(timestampFolder.getName()))) { - FileFactory.deleteAllCarbonFilesOfDir(timestampFolder); - LOGGER.info("Timestamp subfolder from the Trash folder deleted: " + timestampFolder + if (isTrashRetentionTimeoutExceeded(Long.parseLong(timestampFolder.getName()))) { + if (timestampFolder.isDirectory()) { + sizeFreed += FileFactory.getDirectorySize(timestampFolder.getAbsolutePath()); + filesToDelete.add(timestampFolder); + } + } + } + if (!isDryRun) { + for (CarbonFile carbonFile : filesToDelete) { + LOGGER.info("Timestamp subfolder from the Trash folder deleted: " + carbonFile .getAbsolutePath()); + FileFactory.deleteAllCarbonFilesOfDir(carbonFile); } } } } catch (IOException e) { LOGGER.error("Error during deleting expired timestamp folder from the trash folder", e); } + return new long[] {sizeFreed, trashFolderSize - sizeFreed}; } /** * The below method deletes all the files and folders in the trash folder of a carbon table. + * Returns an array in which the first element contains the size freed in case of clean files + * operation or size that can be freed in case of dry run and the second element contains the + * remaining size. */ - public static void emptyTrash(String tablePath) { + public static long[] emptyTrash(String tablePath, Boolean isDryRun) { CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath .getTrashFolderPath(tablePath)); // if the trash folder exists delete the contents of the trash folder + long sizeFreed = 0; + long[] sizeStatistics = new long[]{0, 0}; try { if (trashFolder.isFileExist()) { CarbonFile[] carbonFileList = trashFolder.listFiles(); + List<CarbonFile> filesToDelete = new ArrayList<>(); for (CarbonFile carbonFile : carbonFileList) { - FileFactory.deleteAllCarbonFilesOfDir(carbonFile); + sizeFreed += FileFactory.getDirectorySize(carbonFile.getAbsolutePath()); Review comment: same as above ########## File path: docs/clean-files.md ########## @@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete Marked for delete, Com ``` CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 'force'='true') - ``` \ No newline at end of file + ``` +### DRY RUN OPTION +Clean files also support a dry run option which will let the user know how much space fill we freed +during the actual clean files operation. The dry run operation will not delete any data but will just give +size bases statistics to the data. Dry run operation will return two columns where the first will Review comment: ```suggestion size based statistics on the data which will be cleaned in clean files. Dry run operation will return two columns where the first will ``` ########## File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala ########## @@ -72,11 +78,26 @@ object DataTrashManager { carbonDeleteSegmentLock = CarbonLockUtil.getLockObject(carbonTable .getAbsoluteTableIdentifier, LockUsage.DELETE_SEGMENT_LOCK, deleteSegmentErrorMsg) // step 1: check and clean trash folder - checkAndCleanTrashFolder(carbonTable, isForceDelete) + // trashFolderSizeStats(0) contains the size that is freed/or can be freed and + // trashFolderSizeStats(1) contains the size of remaining data in the trash folder + val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete, + isDryRun = false) // step 2: move stale segments which are not exists in metadata into .Trash moveStaleSegmentsToTrash(carbonTable) Review comment: if you are calculating the trash size before calling `moveStaleSegmentsToTrash`, it gives wrong info right? because after size calculation, some folders are getting into trash folder again? ########## File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala ########## @@ -87,13 +108,70 @@ object DataTrashManager { } } - private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = { + /** + * Checks the size of the segment files as well as datafiles, this method is used before and after + * clean files operation to check how much space is actually freed, during the operation. + */ + def getSizeScreenshot(carbonTable: CarbonTable): Long = { + val segmentPath = CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath) + var size : Long = 0 + if (!carbonTable.isHivePartitionTable) { + if (carbonTable.getTableInfo.getFactTable.getTableProperties.containsKey( + CarbonCommonConstants.FLAT_FOLDER)) { + // the size is table size + segment folder size - (metadata folder size + lockFiles size) + (FileFactory.getDirectorySize(carbonTable.getTablePath) + FileFactory.getDirectorySize( Review comment: here do not calculate the fact folder size directly , because when there 1000s of segments it will be slow. We are already storing the data and index size in metadata, please make use of it and it will faster ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/secondaryindex/events/CleanFilesPostEventListener.scala ########## @@ -56,7 +56,6 @@ class CleanFilesPostEventListener extends OperationEventListener with Logging { cleanFilesPostEvent.carbonTable, cleanFilesPostEvent.options.getOrElse("force", "false").toBoolean, cleanFilesPostEvent.options.getOrElse("stale_inprogress", "false").toBoolean) - Review comment: revert this file ########## File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala ########## @@ -112,13 +141,91 @@ object DataTrashManager { carbonTable: CarbonTable, isForceDelete: Boolean, cleanStaleInProgress: Boolean, - partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = { + partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = { val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull - SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable, + val sizeStatistics = SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable, isForceDelete, partitionSpecs, cleanStaleInProgress, true) if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) { SegmentFileStore.cleanSegments(carbonTable, partitionSpecs, isForceDelete) } + sizeStatistics + } + + private def dryRunOnExpiredSegments( + carbonTable: CarbonTable, + isForceDelete: Boolean, + cleanStaleInProgress: Boolean, + partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = { + var sizeFreed: Long = 0 + var trashSizeRemaining: Long = 0 + val loadMetadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath) + if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) { + loadMetadataDetails.foreach { oneLoad => + val segmentFilePath = CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath, + oneLoad.getSegmentFile) + if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete, cleanStaleInProgress)) { + // No need to consider physical data for external segments, only consider metadata. + if (oneLoad.getPath() == null || oneLoad.getPath().equalsIgnoreCase("NA")) { + if (!carbonTable.isHivePartitionTable) { + sizeFreed += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable + .getTablePath, oneLoad.getLoadName)) + } else { + sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad, loadMetadataDetails, + partitionSpecsOption) + } + } + sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize + } else { + if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable + .getAbsoluteTableIdentifier)) { + if (!carbonTable.isHivePartitionTable) { + trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath( + carbonTable.getTablePath, oneLoad.getLoadName)) + } else { + trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad, + loadMetadataDetails, partitionSpecsOption) + } + trashSizeRemaining += FileFactory.getCarbonFile(segmentFilePath).getSize + } + } + } + } + Seq(sizeFreed, trashSizeRemaining) + } + + def partitionTableSegmentSize( carbonTable: CarbonTable, oneLoad: LoadMetadataDetails, + loadMetadataDetails: Array[LoadMetadataDetails], partitionSpecsOption: + Option[Seq[PartitionSpec]]) : Long = { + var segmentSize: Long = 0 Review comment: i agree with @ajantha-bhat , may be use loadmetadata details data and index size only and for update files you can use some logic, please check ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578962180 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala ########## @@ -87,13 +108,70 @@ object DataTrashManager { } } - private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = { + /** + * Checks the size of the segment files as well as datafiles, this method is used before and after + * clean files operation to check how much space is actually freed, during the operation. + */ + def getSizeScreenshot(carbonTable: CarbonTable): Long = { + val segmentPath = CarbonTablePath.getSegmentFilesLocation(carbonTable.getTablePath) + var size : Long = 0 + if (!carbonTable.isHivePartitionTable) { + if (carbonTable.getTableInfo.getFactTable.getTableProperties.containsKey( + CarbonCommonConstants.FLAT_FOLDER)) { + // the size is table size + segment folder size - (metadata folder size + lockFiles size) + (FileFactory.getDirectorySize(carbonTable.getTablePath) + FileFactory.getDirectorySize( Review comment: done ########## File path: integration/spark/src/main/scala/org/apache/spark/sql/secondaryindex/events/CleanFilesPostEventListener.scala ########## @@ -56,7 +56,6 @@ class CleanFilesPostEventListener extends OperationEventListener with Logging { cleanFilesPostEvent.carbonTable, cleanFilesPostEvent.options.getOrElse("force", "false").toBoolean, cleanFilesPostEvent.options.getOrElse("stale_inprogress", "false").toBoolean) - Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578963097 ########## File path: docs/clean-files.md ########## @@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete Marked for delete, Com ``` CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 'force'='true') - ``` \ No newline at end of file + ``` +### DRY RUN OPTION +Clean files also support a dry run option which will let the user know how much space fill we freed +during the actual clean files operation. The dry run operation will not delete any data but will just give +size bases statistics to the data. Dry run operation will return two columns where the first will +show how much space will be freed by that clean files operation and the second column will show the +remaining stale data(data which can be deleted but has not yet expired as per the ```max.query.execution.time``` and ``` carbon.trash.retention.days``` values +). By default the value of ```dryrun``` option is ```false```. + +Dry Run Operation is supported with four types of commands: + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('dryrun'='true') + ``` + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('force'='true', 'dryrun'='true') + ``` + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true','dryrun'='true') + ``` + + ``` + CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 'force'='true','dryrun'='true') + ``` + +**NOTE**: + * Since the dry run operation will calculate size and will access File level API's, the operation can + be a costly and a time consuming operation in case of tables with large number of segments. Review comment: done ########## File path: docs/clean-files.md ########## @@ -64,4 +64,40 @@ The stale_inprogress option with force option will delete Marked for delete, Com ``` CLEAN FILES FOR TABLE TABLE_NAME options('stale_inprogress'='true', 'force'='true') - ``` \ No newline at end of file + ``` +### DRY RUN OPTION +Clean files also support a dry run option which will let the user know how much space fill we freed +during the actual clean files operation. The dry run operation will not delete any data but will just give +size bases statistics to the data. Dry run operation will return two columns where the first will Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578967247 ########## File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java ########## @@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> filesToCopy, * The below method deletes timestamp subdirectories in the trash folder which have expired as * per the user defined retention time */ - public static void deleteExpiredDataFromTrash(String tablePath) { + public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean isDryRun) { CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath .getTrashFolderPath(tablePath)); + long sizeFreed = 0; + long trashFolderSize = 0; // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp. try { if (trashFolder.isFileExist()) { + trashFolderSize = FileFactory.getDirectorySize(trashFolder.getAbsolutePath()); Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578967973 ########## File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java ########## @@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> filesToCopy, * The below method deletes timestamp subdirectories in the trash folder which have expired as * per the user defined retention time */ - public static void deleteExpiredDataFromTrash(String tablePath) { + public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean isDryRun) { Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578974314 ########## File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java ########## @@ -152,46 +153,77 @@ public static void copyFilesToTrash(List<String> filesToCopy, * The below method deletes timestamp subdirectories in the trash folder which have expired as * per the user defined retention time */ - public static void deleteExpiredDataFromTrash(String tablePath) { + public static long[] deleteExpiredDataFromTrash(String tablePath, Boolean isDryRun) { CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath .getTrashFolderPath(tablePath)); + long sizeFreed = 0; + long trashFolderSize = 0; // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp. try { if (trashFolder.isFileExist()) { + trashFolderSize = FileFactory.getDirectorySize(trashFolder.getAbsolutePath()); CarbonFile[] timestampFolderList = trashFolder.listFiles(); + List<CarbonFile> filesToDelete = new ArrayList<>(); for (CarbonFile timestampFolder : timestampFolderList) { // If the timeStamp at which the timeStamp subdirectory has expired as per the user // defined value, delete the complete timeStamp subdirectory - if (timestampFolder.isDirectory() && isTrashRetentionTimeoutExceeded(Long - .parseLong(timestampFolder.getName()))) { - FileFactory.deleteAllCarbonFilesOfDir(timestampFolder); - LOGGER.info("Timestamp subfolder from the Trash folder deleted: " + timestampFolder + if (isTrashRetentionTimeoutExceeded(Long.parseLong(timestampFolder.getName()))) { + if (timestampFolder.isDirectory()) { + sizeFreed += FileFactory.getDirectorySize(timestampFolder.getAbsolutePath()); + filesToDelete.add(timestampFolder); + } + } + } + if (!isDryRun) { + for (CarbonFile carbonFile : filesToDelete) { + LOGGER.info("Timestamp subfolder from the Trash folder deleted: " + carbonFile .getAbsolutePath()); + FileFactory.deleteAllCarbonFilesOfDir(carbonFile); } } } } catch (IOException e) { LOGGER.error("Error during deleting expired timestamp folder from the trash folder", e); } + return new long[] {sizeFreed, trashFolderSize - sizeFreed}; } /** * The below method deletes all the files and folders in the trash folder of a carbon table. + * Returns an array in which the first element contains the size freed in case of clean files + * operation or size that can be freed in case of dry run and the second element contains the + * remaining size. */ - public static void emptyTrash(String tablePath) { + public static long[] emptyTrash(String tablePath, Boolean isDryRun) { CarbonFile trashFolder = FileFactory.getCarbonFile(CarbonTablePath .getTrashFolderPath(tablePath)); // if the trash folder exists delete the contents of the trash folder + long sizeFreed = 0; + long[] sizeStatistics = new long[]{0, 0}; try { if (trashFolder.isFileExist()) { CarbonFile[] carbonFileList = trashFolder.listFiles(); + List<CarbonFile> filesToDelete = new ArrayList<>(); for (CarbonFile carbonFile : carbonFileList) { - FileFactory.deleteAllCarbonFilesOfDir(carbonFile); + sizeFreed += FileFactory.getDirectorySize(carbonFile.getAbsolutePath()); Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578981989 ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ########## @@ -983,7 +983,28 @@ public static boolean isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden } } - private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) { + public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, AbsoluteTableIdentifier + absoluteTableIdentifier) { + boolean result = false; + if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED || oneLoad.getSegmentStatus() == + SegmentStatus.MARKED_FOR_DELETE) { + return true; + } else if (oneLoad.getSegmentStatus() == SegmentStatus.INSERT_IN_PROGRESS || oneLoad + .getSegmentStatus() == SegmentStatus.INSERT_OVERWRITE_IN_PROGRESS) { + // check if lock can be acquired + ICarbonLock segmentLock = CarbonLockFactory.getCarbonLockObj(absoluteTableIdentifier, + CarbonTablePath.addSegmentPrefix(oneLoad.getLoadName()) + LockUsage.LOCK); + if (segmentLock.lockWithRetries()) { + result = true; + segmentLock.unlock(); Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578982092 ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ########## @@ -983,7 +983,28 @@ public static boolean isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden } } - private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) { + public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, AbsoluteTableIdentifier + absoluteTableIdentifier) { + boolean result = false; + if (oneLoad.getSegmentStatus() == SegmentStatus.COMPACTED || oneLoad.getSegmentStatus() == + SegmentStatus.MARKED_FOR_DELETE) { + return true; Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578982527 ########## File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java ########## @@ -983,7 +983,28 @@ public static boolean isLoadInProgress(AbsoluteTableIdentifier absoluteTableIden } } - private static boolean isLoadDeletionRequired(LoadMetadataDetails[] details) { + public static boolean isExpiredSegment(LoadMetadataDetails oneLoad, AbsoluteTableIdentifier + absoluteTableIdentifier) { + boolean result = false; Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578982799 ########## File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java ########## @@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, Segment segment, Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap(); for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) { FileFactory.deleteFile(entry.getKey()); + LOGGER.info("File deleted after clean files operation: " + entry.getKey()); Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578987875 ########## File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java ########## @@ -1135,13 +1135,16 @@ public static void deleteSegment(String tablePath, Segment segment, Map<String, List<String>> indexFilesMap = fileStore.getIndexFilesMap(); for (Map.Entry<String, List<String>> entry : indexFilesMap.entrySet()) { FileFactory.deleteFile(entry.getKey()); + LOGGER.info("File deleted after clean files operation: " + entry.getKey()); for (String file : entry.getValue()) { String[] deltaFilePaths = updateStatusManager.getDeleteDeltaFilePath(file, segment.getSegmentNo()); for (String deltaFilePath : deltaFilePaths) { FileFactory.deleteFile(deltaFilePath); + LOGGER.info("File deleted after clean files operation: " + deltaFilePath); Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578990010 ########## File path: core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java ########## @@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable carbonTable) // Deleting the stale Segment folders and the segment file. try { CarbonUtil.deleteFoldersAndFiles(segmentPath); + LOGGER.info("Deleted the segment folder :" + segmentPath.getAbsolutePath() + " after" + + " moving it to the trash folder"); // delete the segment file as well FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(), staleSegmentFile)); + LOGGER.info("Deleted stale segment file after moving it to the trash folder :" + + staleSegmentFile); Review comment: done, actually the segment file will not be moved to trash, it will just be delete straightaway, Handled the same ########## File path: core/src/main/java/org/apache/carbondata/core/util/CleanFilesUtil.java ########## @@ -71,14 +71,19 @@ public static void cleanStaleSegments(CarbonTable carbonTable) // Deleting the stale Segment folders and the segment file. try { CarbonUtil.deleteFoldersAndFiles(segmentPath); + LOGGER.info("Deleted the segment folder :" + segmentPath.getAbsolutePath() + " after" + + " moving it to the trash folder"); // delete the segment file as well FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath(), staleSegmentFile)); + LOGGER.info("Deleted stale segment file after moving it to the trash folder :" + + staleSegmentFile); for (String duplicateStaleSegmentFile : redundantSegmentFile) { if (DataFileUtil.getSegmentNoFromSegmentFile(duplicateStaleSegmentFile) .equals(segmentNumber)) { FileFactory.deleteFile(CarbonTablePath.getSegmentFilePath(carbonTable .getTablePath(), duplicateStaleSegmentFile)); + LOGGER.info("Deleted redundant segment file :" + duplicateStaleSegmentFile); Review comment: done ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
vikramahuja1001 commented on a change in pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#discussion_r578993995 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala ########## @@ -72,11 +78,26 @@ object DataTrashManager { carbonDeleteSegmentLock = CarbonLockUtil.getLockObject(carbonTable .getAbsoluteTableIdentifier, LockUsage.DELETE_SEGMENT_LOCK, deleteSegmentErrorMsg) // step 1: check and clean trash folder - checkAndCleanTrashFolder(carbonTable, isForceDelete) + // trashFolderSizeStats(0) contains the size that is freed/or can be freed and + // trashFolderSizeStats(1) contains the size of remaining data in the trash folder + val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete, + isDryRun = false) // step 2: move stale segments which are not exists in metadata into .Trash moveStaleSegmentsToTrash(carbonTable) Review comment: No it won't matter, because we need to show how much space can be cleared from the trash, when moving from segment folder to the trash folder, we will be cleaning it in the next clean files command only, so we don't need to put these stats ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781958308 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/5089/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA2 commented on pull request #4072: URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781959981 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/3330/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |