[GitHub] [carbondata] vikramahuja1001 opened a new pull request #4051: [WIP] Only consider .segment files for stale segments

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 opened a new pull request #4051: [WIP] Only consider .segment files for stale segments

GitBox

vikramahuja1001 opened a new pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051


    ### Why is this PR needed?
    While getting stale segment, we do a list files and take all the files, if there is any folder/file other than .segment file, it will lead to further issues while copying data to the trash folder.
   
    ### What changes were proposed in this PR?
   Added a filter, to only consider the files ending with ".segment"
       
    ### Does this PR introduce any user interface change?
    - No
   
   
    ### Is any new testcase added?
    - No
   
   
       
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [WIP] Only consider .segment files for stale segments

GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-742442851


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5137/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [WIP] Only consider .segment files for stale segments

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-742453013


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3375/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on pull request #4051: [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments folder for cleaning stale segments

GitBox
In reply to this post by GitBox

QiangCai commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-742907587


   can you add a test case for fault testing?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on pull request #4051: [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments folder for cleaning stale segments

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745189584


   @QiangCai , i have changed current test cases itself. Please check


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments folder for cleaning stale segments

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745238696


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5166/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [CARBONDATA-4081] In Clean files operation, only consider ".segment" files in the segments folder for cleaning stale segments

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745241197


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3404/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745308252


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5168/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745308552


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3406/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

ajantha-bhat commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745312384


   @vikramahuja1001 : when the issue is found, what other file was present in segment metadata directory ? how it is impacting ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745790657


   @ajantha-bhat , so in the case of partition table, we have a tmp directory in the segment folder. In this case, while we check for stale segments, we are blindly reading all the files in the segment folder and collecting the name of it and then based on that we read the segment files, since it is not a file, thus it cannot be read and would give an exception. In this fix, we just need to filter if the file ends with ".segment"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544064420



##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -157,11 +157,12 @@ public static void deleteExpiredDataFromTrash(String tablePath) {
     // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp.
     try {
       if (FileFactory.isFileExist(trashPath)) {

Review comment:
       ```suggestion
         CarbonFile trashFile = FileFactory.getCarbonFile(trashPath);
         if (trashFile.exists()) {
           CarbonFile[] timestampFolderList = trashFile.listFiles();
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544064935



##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -181,7 +182,7 @@ public static void emptyTrash(String tablePath) {
     // if the trash folder exists delete the contents of the trash folder
     try {
       if (FileFactory.isFileExist(trashPath)) {

Review comment:
       handle same as above comment




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] Indhumathi27 commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

Indhumathi27 commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544069025



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##########
@@ -38,26 +40,33 @@ case class CarbonCleanFilesCommand(
     isInternalCleanCall: Boolean = false)
   extends DataCommand {
 
+  val LOGGER: Logger = LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
   override def processData(sparkSession: SparkSession): Seq[Row] = {
     Checker.validateTableExists(databaseNameOp, tableName, sparkSession)
     val carbonTable = CarbonEnv.getCarbonTable(databaseNameOp, tableName)(sparkSession)
     setAuditTable(carbonTable)
-    // if insert overwrite in progress, do not allow delete segment
-    if (SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
+    // if insert overwrite in progress and table not a MV, do not allow delete segment
+    if (!carbonTable.isMV && SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
       throw new ConcurrentOperationException(carbonTable, "insert overwrite", "clean file")
     }
     if (!carbonTable.getTableInfo.isTransactionalTable) {
       throw new MalformedCarbonCommandException("Unsupported operation on non transactional table")
     }
 
-    val preEvent = CleanFilesPreEvent(carbonTable, sparkSession)
-    val postEvent = CleanFilesPostEvent(carbonTable, sparkSession, options)
-    withEvents(preEvent, postEvent) {
-      DataTrashManager.cleanGarbageData(
-        carbonTable,
-        options.getOrElse("force", "false").toBoolean,
-        options.getOrElse("stale_inprogress", "false").toBoolean,
-        CarbonFilters.getPartitions(Seq.empty[Expression], sparkSession, carbonTable))
+    // only proceed if not a MV and if insert overwrite not in progress
+    if (!carbonTable.isMV && !SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
+      val preEvent = CleanFilesPreEvent(carbonTable, sparkSession)
+      val postEvent = CleanFilesPostEvent(carbonTable, sparkSession, options)
+      withEvents(preEvent, postEvent) {
+        DataTrashManager.cleanGarbageData(
+          carbonTable,
+          options.getOrElse("force", "false").toBoolean,
+          options.getOrElse("stale_inprogress", "false").toBoolean,
+          CarbonFilters.getPartitions(Seq.empty[Expression], sparkSession, carbonTable))
+      }
+    } else {
+      LOGGER.info(s"Can not do clean files operation for the MV: ${carbonTable.getTableName}")

Review comment:
       Clean files for MV is not supported?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544072761



##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -157,11 +157,12 @@ public static void deleteExpiredDataFromTrash(String tablePath) {
     // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp.
     try {
       if (FileFactory.isFileExist(trashPath)) {

Review comment:
       I think we should deprecate listfiles and fileExist and other API in file factory that takes path, it has to take carbon file instead of path. This way we can force user to think about reusing the carbonFile Object.
   we can raise a JIRA and assign some new contributors.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544073512



##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -157,11 +157,12 @@ public static void deleteExpiredDataFromTrash(String tablePath) {
     // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp.
     try {
       if (FileFactory.isFileExist(trashPath)) {

Review comment:
       currently Vikram can handle for his PR by accepting suggested changes by Indhu. But many places we are not resuing the CarbonFile object. that can be handled by my above point.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544074748



##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -157,11 +157,12 @@ public static void deleteExpiredDataFromTrash(String tablePath) {
     // Deleting the timestamp based subdirectories in the trashfolder by the given timestamp.
     try {
       if (FileFactory.isFileExist(trashPath)) {

Review comment:
       https://issues.apache.org/jira/browse/CARBONDATA-4086




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745862799


   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5175/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

CarbonDataQA2 commented on pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#issuecomment-745867314


   Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3413/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] vikramahuja1001 commented on a change in pull request #4051: [CARBONDATA-4081] Fix multiple issues with clean files command

GitBox
In reply to this post by GitBox

vikramahuja1001 commented on a change in pull request #4051:
URL: https://github.com/apache/carbondata/pull/4051#discussion_r544169226



##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonCleanFilesCommand.scala
##########
@@ -38,26 +40,33 @@ case class CarbonCleanFilesCommand(
     isInternalCleanCall: Boolean = false)
   extends DataCommand {
 
+  val LOGGER: Logger = LogServiceFactory.getLogService(this.getClass.getCanonicalName)
+
   override def processData(sparkSession: SparkSession): Seq[Row] = {
     Checker.validateTableExists(databaseNameOp, tableName, sparkSession)
     val carbonTable = CarbonEnv.getCarbonTable(databaseNameOp, tableName)(sparkSession)
     setAuditTable(carbonTable)
-    // if insert overwrite in progress, do not allow delete segment
-    if (SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
+    // if insert overwrite in progress and table not a MV, do not allow delete segment
+    if (!carbonTable.isMV && SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
       throw new ConcurrentOperationException(carbonTable, "insert overwrite", "clean file")
     }
     if (!carbonTable.getTableInfo.isTransactionalTable) {
       throw new MalformedCarbonCommandException("Unsupported operation on non transactional table")
     }
 
-    val preEvent = CleanFilesPreEvent(carbonTable, sparkSession)
-    val postEvent = CleanFilesPostEvent(carbonTable, sparkSession, options)
-    withEvents(preEvent, postEvent) {
-      DataTrashManager.cleanGarbageData(
-        carbonTable,
-        options.getOrElse("force", "false").toBoolean,
-        options.getOrElse("stale_inprogress", "false").toBoolean,
-        CarbonFilters.getPartitions(Seq.empty[Expression], sparkSession, carbonTable))
+    // only proceed if not a MV and if insert overwrite not in progress
+    if (!carbonTable.isMV && !SegmentStatusManager.isOverwriteInProgressInTable(carbonTable)) {
+      val preEvent = CleanFilesPreEvent(carbonTable, sparkSession)
+      val postEvent = CleanFilesPostEvent(carbonTable, sparkSession, options)
+      withEvents(preEvent, postEvent) {
+        DataTrashManager.cleanGarbageData(
+          carbonTable,
+          options.getOrElse("force", "false").toBoolean,
+          options.getOrElse("stale_inprogress", "false").toBoolean,
+          CarbonFilters.getPartitions(Seq.empty[Expression], sparkSession, carbonTable))
+      }
+    } else {
+      LOGGER.info(s"Can not do clean files operation for the MV: ${carbonTable.getTableName}")

Review comment:
       changed code

##########
File path: core/src/main/java/org/apache/carbondata/core/util/TrashUtil.java
##########
@@ -181,7 +182,7 @@ public static void emptyTrash(String tablePath) {
     // if the trash folder exists delete the contents of the trash folder
     try {
       if (FileFactory.isFileExist(trashPath)) {

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


12