Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

Login Register

[GitHub] [carbondata] vikramahuja1001 opened a new pull request #4072: [WIP] Clean files phase2

Classic

List

180 messages Options

Options

123456 ... 9

GitBox

Feb 16, 2021; 7:33am

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

vikramahuja1001 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r576601866

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##########
@@ -37,11 +38,24 @@ case class CarbonCleanFilesCommand(
databaseNameOp: Option[String],
tableName: String,
options: Map[String, String] = Map.empty,
+ dryRun: Boolean,
isInternalCleanCall: Boolean = false)
extends DataCommand {

val LOGGER: Logger = LogServiceFactory.getLogService(this.getClass.getCanonicalName)

+ override def output: Seq[AttributeReference] = {
+ if (dryRun) {
+ Seq(
+ AttributeReference("Size that will be Freed", LongType, nullable = false)(),

Review comment:
done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 16, 2021; 9:17am

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-779698201

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3708/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 16, 2021; 9:19am

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-779699134

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5472/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 7:45am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577387770

##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/impl/FileFactory.java
##########
@@ -297,6 +297,10 @@ public static boolean deleteFile(String filePath) throws IOException {
return getCarbonFile(filePath).deleteFile();
}

+ public static boolean deleteFile(CarbonFile carbonFile) throws IOException {

Review comment:
Please remove `deleteFile(CarbonFile carbonFile)`, because there is no extra function is happening here. directly call carbonFile.deleteFile() in the usages.

If there was some extra functionality, then it makes sense to have to APIfor carbonFile also to avoid getting CarbonFile again. for this case it is not needed

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 7:45am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577387770

##########
File path: core/src/main/java/org/apache/carbondata/core/datastore/impl/FileFactory.java
##########
@@ -297,6 +297,10 @@ public static boolean deleteFile(String filePath) throws IOException {
return getCarbonFile(filePath).deleteFile();
}

+ public static boolean deleteFile(CarbonFile carbonFile) throws IOException {

Review comment:
Please remove `deleteFile(CarbonFile carbonFile)`, because there is no extra function is happening here. directly call carbonFile.deleteFile() in the usages.

If there was some extra functionality, then it makes sense to have to API for carbonFile also to avoid getting CarbonFile again. for this case it is not needed

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 7:46am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577388524

##########
File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1112,11 +1112,15 @@ public static void cleanSegments(CarbonTable table,
cleanSegments(table, details, partitionSpecs, forceDelete);
}

- public static void deleteSegmentFile(String tablePath, Segment segment) throws Exception {
+ public static long deleteSegmentFile(String tablePath, Segment segment) throws Exception {
String segmentFilePath =
CarbonTablePath.getSegmentFilePath(tablePath, segment.getSegmentFileName());
// Deletes the physical segment file
- FileFactory.deleteFile(segmentFilePath);
+ CarbonFile carbonSegmentFile = FileFactory.getCarbonFile(segmentFilePath);
+ long sizeFreed = carbonSegmentFile.getSize();
+ FileFactory.deleteFile(carbonSegmentFile);

Review comment:
As per above comment, you can directly call segmentFilePath.delete()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 7:47am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577388524

##########
File path: core/src/main/java/org/apache/carbondata/core/metadata/SegmentFileStore.java
##########
@@ -1112,11 +1112,15 @@ public static void cleanSegments(CarbonTable table,
cleanSegments(table, details, partitionSpecs, forceDelete);
}

- public static void deleteSegmentFile(String tablePath, Segment segment) throws Exception {
+ public static long deleteSegmentFile(String tablePath, Segment segment) throws Exception {
String segmentFilePath =
CarbonTablePath.getSegmentFilePath(tablePath, segment.getSegmentFileName());
// Deletes the physical segment file
- FileFactory.deleteFile(segmentFilePath);
+ CarbonFile carbonSegmentFile = FileFactory.getCarbonFile(segmentFilePath);
+ long sizeFreed = carbonSegmentFile.getSize();
+ FileFactory.deleteFile(carbonSegmentFile);

Review comment:
As per above comment, you can directly call carbonSegmentFile.delete()

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 8:05am

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-780379984

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5477/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 8:23am

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-780388512

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3713/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 9:24am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577450283

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1072,7 +1097,22 @@ public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
isUpdateRequired(isForceDeletion, carbonTable,
identifier, details, cleanStaleInprogress);
if (!tuple2.isUpdateRequired) {
- return;
+ try {
+ for (LoadMetadataDetails oneLoad : details) {
+ if (isExpiredSegment(oneLoad, carbonTable.getAbsoluteTableIdentifier())) {
+ if (!carbonTable.isHivePartitionTable()) {
+ trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath(), oneLoad.getLoadName()));
+ } else {
+ trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+ details, partitionSpecs);
+ }
+ }
+ }
+ } catch (Exception e) {
+ LOG.error("Unable to calculate size of garbage data", e);
+ }
+ return new long[]{sizeFreed, trashSizeRemaining};

Review comment:
when nothing is freed by clean files, returning 0 is ok. But when some exception happens in dry run may be better to throw exception

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 9:27am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577452054

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1125,13 +1165,32 @@ public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
CarbonLockUtil.fileUnlock(carbonTableStatusLock, LockUsage.TABLE_STATUS_LOCK);
}
if (updateCompletionStatus) {
- DeleteLoadFolders
+ long[] cleanFileSizeFreed = DeleteLoadFolders
.physicalFactAndMeasureMetadataDeletion(carbonTable, newAddedLoadHistoryList,
isForceDeletion, partitionSpecs, cleanStaleInprogress);
+ sizeFreed += cleanFileSizeFreed[0];
+ trashSizeRemaining += cleanFileSizeFreed[1];
+ }
+ }
+ } else {
+ try {
+ for (LoadMetadataDetails oneLoad : metadataDetails) {
+ if (isExpiredSegment(oneLoad, carbonTable.getAbsoluteTableIdentifier())) {
+ if (!carbonTable.isHivePartitionTable()) {
+ trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(
+ carbonTable.getTablePath(), oneLoad.getLoadName()));
+ } else {
+ trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+ metadataDetails, partitionSpecs);
+ }
+ }
}
+ } catch (Exception e) {
+ LOG.error("Unable to calculate size of garbage data", e);
}
}
}
+ return new long[]{sizeFreed, trashSizeRemaining};

Review comment:
when nothing is freed by clean files, returning 0 is ok. But when some exception happens in dry run may be better to throw exception

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 9:35am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577458113

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1297,4 +1359,37 @@ public static TableStatusReturnTuple separateVisibleAndInvisibleSegments(
return new HashMap<>(0);
}
}
+
+ public static long partitionTableSegmentSize(CarbonTable carbonTable, LoadMetadataDetails

Review comment:
I am thinking now all the clean file operations will become slow because of these size calculation code, which need to interact with the file system.

so, can we can some option as `summary = false`, which won't do any new size calculation operation and clean the files faster ?? @akashrn5 , @QiangCai what you think ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 9:38am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577458113

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1297,4 +1359,37 @@ public static TableStatusReturnTuple separateVisibleAndInvisibleSegments(
return new HashMap<>(0);
}
}
+
+ public static long partitionTableSegmentSize(CarbonTable carbonTable, LoadMetadataDetails

Review comment:
I am thinking now all the clean file operations will become slow because of these size calculation code, which need to interact with the file system.

Default we can have this size calculation. but if user wants clean files to be faster. Can we have some option as `summary = false`, which won't do any new size calculation operation and clean the files faster ?? @akashrn5 , @QiangCai what you think ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 10:46am

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577500588

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/events/CleanFilesEvents.scala
##########
@@ -26,5 +26,6 @@ case class CleanFilesPreEvent(carbonTable: CarbonTable, sparkSession: SparkSessi
case class CleanFilesPostEvent(
carbonTable: CarbonTable,
sparkSession: SparkSession,
- options: Map[String, String])
+ options: Map[String, String],
+ dryRun: Boolean)

Review comment:
why not sending this in options itself ?

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -112,13 +141,91 @@ object DataTrashManager {
carbonTable: CarbonTable,
isForceDelete: Boolean,
cleanStaleInProgress: Boolean,
- partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = {
+ partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull
- SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
+ val sizeStatistics = SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
isForceDelete, partitionSpecs, cleanStaleInProgress, true)
if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) {
SegmentFileStore.cleanSegments(carbonTable, partitionSpecs, isForceDelete)
}
+ sizeStatistics
+ }
+
+ private def dryRunOnExpiredSegments(
+ carbonTable: CarbonTable,
+ isForceDelete: Boolean,
+ cleanStaleInProgress: Boolean,
+ partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
+ var sizeFreed: Long = 0
+ var trashSizeRemaining: Long = 0
+ val loadMetadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+ if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) {
+ loadMetadataDetails.foreach { oneLoad =>
+ val segmentFilePath = CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath,
+ oneLoad.getSegmentFile)
+ if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete, cleanStaleInProgress)) {
+ // No need to consider physical data for external segments, only consider metadata.
+ if (oneLoad.getPath() == null || oneLoad.getPath().equalsIgnoreCase("NA")) {
+ if (!carbonTable.isHivePartitionTable) {
+ sizeFreed += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable
+ .getTablePath, oneLoad.getLoadName))
+ } else {
+ sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad, loadMetadataDetails,
+ partitionSpecsOption)
+ }
+ }
+ sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize
+ } else {
+ if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable
+ .getAbsoluteTableIdentifier)) {
+ if (!carbonTable.isHivePartitionTable) {
+ trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(

Review comment:
I see that size calculation code is duplicate in dryrun flow and in clean up flow, can we extract a common method and use it ?

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -112,13 +141,91 @@ object DataTrashManager {
carbonTable: CarbonTable,
isForceDelete: Boolean,
cleanStaleInProgress: Boolean,
- partitionSpecsOption: Option[Seq[PartitionSpec]]): Unit = {
+ partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
val partitionSpecs = partitionSpecsOption.map(_.asJava).orNull
- SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
+ val sizeStatistics = SegmentStatusManager.deleteLoadsAndUpdateMetadata(carbonTable,
isForceDelete, partitionSpecs, cleanStaleInProgress, true)
if (carbonTable.isHivePartitionTable && partitionSpecsOption.isDefined) {
SegmentFileStore.cleanSegments(carbonTable, partitionSpecs, isForceDelete)
}
+ sizeStatistics
+ }
+
+ private def dryRunOnExpiredSegments(
+ carbonTable: CarbonTable,
+ isForceDelete: Boolean,
+ cleanStaleInProgress: Boolean,
+ partitionSpecsOption: Option[Seq[PartitionSpec]]): Seq[Long] = {
+ var sizeFreed: Long = 0
+ var trashSizeRemaining: Long = 0
+ val loadMetadataDetails = SegmentStatusManager.readLoadMetadata(carbonTable.getMetadataPath)
+ if (SegmentStatusManager.isLoadDeletionRequired(loadMetadataDetails)) {
+ loadMetadataDetails.foreach { oneLoad =>
+ val segmentFilePath = CarbonTablePath.getSegmentFilePath(carbonTable.getTablePath,
+ oneLoad.getSegmentFile)
+ if (DeleteLoadFolders.canDeleteThisLoad(oneLoad, isForceDelete, cleanStaleInProgress)) {
+ // No need to consider physical data for external segments, only consider metadata.
+ if (oneLoad.getPath() == null || oneLoad.getPath().equalsIgnoreCase("NA")) {
+ if (!carbonTable.isHivePartitionTable) {
+ sizeFreed += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(carbonTable
+ .getTablePath, oneLoad.getLoadName))
+ } else {
+ sizeFreed += partitionTableSegmentSize(carbonTable, oneLoad, loadMetadataDetails,
+ partitionSpecsOption)
+ }
+ }
+ sizeFreed += FileFactory.getCarbonFile(segmentFilePath).getSize
+ } else {
+ if (SegmentStatusManager.isExpiredSegment(oneLoad, carbonTable
+ .getAbsoluteTableIdentifier)) {
+ if (!carbonTable.isHivePartitionTable) {
+ trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath.getSegmentPath(
+ carbonTable.getTablePath, oneLoad.getLoadName))
+ } else {
+ trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+ loadMetadataDetails, partitionSpecsOption)
+ }
+ trashSizeRemaining += FileFactory.getCarbonFile(segmentFilePath).getSize
+ }
+ }
+ }
+ }
+ Seq(sizeFreed, trashSizeRemaining)
+ }
+
+ def partitionTableSegmentSize( carbonTable: CarbonTable, oneLoad: LoadMetadataDetails,
+ loadMetadataDetails: Array[LoadMetadataDetails], partitionSpecsOption:
+ Option[Seq[PartitionSpec]]) : Long = {
+ var segmentSize: Long = 0

Review comment:
I see that size calculation code is duplicate in dryrun flow and in clean up flow, can we extract a common method and use it ?

##########
File path: integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/cleanfiles/TestCleanFilesCommandPartitionTable.scala
##########
@@ -186,7 +193,9 @@ class TestCleanFilesCommandPartitionTable extends QueryTest with BeforeAndAfterA
removeSegmentEntryFromTableStatusFile(CarbonEnv.getCarbonTable(Some("default"), "cleantest")(
sqlContext.sparkSession), "2")

- sql(s"CLEAN FILES FOR TABLE CLEANTEST").show()
+ val df1 = sql(s"CLEAN FILES FOR TABLE CLEANTEST DRYRUN")

Review comment:
please verify (at least manully) that dry run size and without dry run, it shows the correct size estimation (if somewhere we are adding from two places, it might give incorrect results)

Also verify the summary is same as actual size cleaned up with big table in backend for both partition and non-partition table.

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/parser/CarbonSpark2SqlParser.scala
##########
@@ -513,12 +513,13 @@ class CarbonSpark2SqlParser extends CarbonDDLSqlParser {

protected lazy val cleanFiles: Parser[LogicalPlan] =
CLEAN ~> FILES ~> FOR ~> TABLE ~> (ident <~ ".").? ~ ident ~
- (OPTIONS ~> "(" ~> repsep(options, ",") <~ ")").? <~ opt(";") ^^ {
- case databaseName ~ tableName ~ optionList =>
+ (OPTIONS ~> "(" ~> repsep(options, ",") <~ ")").? ~ opt(DRYRUN) <~ opt(";") ^^ {

Review comment:
In the current clean files parser itself would have added the option of dry run (similar to force delete option), why adding a new syntax in the end ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 11:09am

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577523404

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1072,7 +1097,22 @@ public static void deleteLoadsAndUpdateMetadata(CarbonTable carbonTable, boolean
isUpdateRequired(isForceDeletion, carbonTable,
identifier, details, cleanStaleInprogress);
if (!tuple2.isUpdateRequired) {
- return;
+ try {
+ for (LoadMetadataDetails oneLoad : details) {
+ if (isExpiredSegment(oneLoad, carbonTable.getAbsoluteTableIdentifier())) {
+ if (!carbonTable.isHivePartitionTable()) {
+ trashSizeRemaining += FileFactory.getDirectorySize(CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath(), oneLoad.getLoadName()));
+ } else {
+ trashSizeRemaining += partitionTableSegmentSize(carbonTable, oneLoad,
+ details, partitionSpecs);
+ }
+ }
+ }
+ } catch (Exception e) {
+ LOG.error("Unable to calculate size of garbage data", e);
+ }
+ return new long[]{sizeFreed, trashSizeRemaining};

Review comment:
yes, agree with @ajantha-bhat , i meant the same

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 11:14am

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577527003

##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentStatusManager.java
##########
@@ -1297,4 +1359,37 @@ public static TableStatusReturnTuple separateVisibleAndInvisibleSegments(
return new HashMap<>(0);
}
}
+
+ public static long partitionTableSegmentSize(CarbonTable carbonTable, LoadMetadataDetails

Review comment:
yes, better not mix the logic of dry run size calculation and actual clean files, keep it separate, so that user will know for sure that when he/she runs the dry run it might take some time as it will do calculation of size.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 11:32am

[GitHub] [carbondata] akashrn5 commented on a change in pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

akashrn5 commented on a change in pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#discussion_r577533482

##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##########
@@ -37,11 +38,24 @@ case class CarbonCleanFilesCommand(
databaseNameOp: Option[String],
tableName: String,
options: Map[String, String] = Map.empty,
+ dryRun: Boolean,
isInternalCleanCall: Boolean = false)
extends DataCommand {

val LOGGER: Logger = LogServiceFactory.getLogService(this.getClass.getCanonicalName)

+ override def output: Seq[AttributeReference] = {
+ if (dryRun) {
+ Seq(
+ AttributeReference("Size Freed", LongType, nullable = false)(),
+ AttributeReference("Trash Data Remaining", LongType, nullable = false)())
+ } else {
+ Seq(
+ AttributeReference("Size Freed", LongType, nullable = false)(),
+ AttributeReference("Trash Data Remaining", LongType, nullable = false)())
+ }

Review comment:
if else both blocks are same? i think better to give these rows only in case of dry run

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/trash/DataTrashManager.scala
##########
@@ -87,13 +101,28 @@ object DataTrashManager {
}
}

- private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean): Unit = {
+ def cleanFilesDryRunOperation (
+ carbonTable: CarbonTable,
+ isForceDelete: Boolean,
+ cleanStaleInProgress: Boolean,
+ partitionSpecs: Option[Seq[PartitionSpec]] = None): Seq[Long] = {
+ // get size freed from the trash folder
+ val trashFolderSizeStats = checkAndCleanTrashFolder(carbonTable, isForceDelete, isDryRun = true)
+ // get size that will be deleted (MFD, COmpacted, Inprogress segments)
+ val expiredSegmentsSizeStats = dryRunOnExpiredSegments(carbonTable, isForceDelete,
+ cleanStaleInProgress, partitionSpecs)
+ Seq(trashFolderSizeStats.head + expiredSegmentsSizeStats.head, trashFolderSizeStats(1) +
+ expiredSegmentsSizeStats(1))
+ }
+
+ private def checkAndCleanTrashFolder(carbonTable: CarbonTable, isForceDelete: Boolean,
+ isDryRun: Boolean): Seq[Long] = {

Review comment:
i think we are mixing the dry run option also along with forcedelete, and making this complex with code and combination handling, what i think is, when user say dry run, it should be clear that i dont take any other options and i just tell user in return how much and what i am going to clean, thats all, we will not delete or clear any files when dry run. So it will be easy to code and cleaner, may be new class or a new method in clean files command class. What you guys think @ajantha-bhat @QiangCai

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 8:12pm

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-780823505

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5489/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 17, 2021; 8:18pm

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-780826470

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3726/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

Feb 18, 2021; 9:28am

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4072: [CARBONDATA-4110] Support clean files dry run operation and show statistics after clean files operation

In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4072:
URL: https://github.com/apache/carbondata/pull/4072#issuecomment-781208770

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3729/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

123456 ... 9