[GitHub] [carbondata] marchpure opened a new pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure opened a new pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
marchpure opened a new pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589
 
 
   …ble throw exception "the unexpected 0 segment found"
   
   Why is this PR needed?
   Recover processing when executing insert stage a partition table throw exception "the unexpected 0 segment found"
   The reason is the snapshot content of partition table is wrong.
   
   What changes were proposed in this PR?
   Let's review the recover processing's purpose, when the segment is loaded successfully, but fail to clean the stage files, the stage(data) will be loaded again which leads to repetition of data. Before, we expect to save a snapshot file consists of segment id and stage file list, the stage files failed to clean can be clean again when we found the corresponding segment has been loaded successfully. But the segmentid can't be achieve while loading partition table, in the other words, we can't get the right snapshot for partition table.
   After discussion, we believe that this problem can also be alleviated by adding retry mechanism when deleting stage files. Thus, there are two changes were proposed which is shown below.
   1) Remove recover processing of Insert stage command.
   2) Add Retry to delete stage files.
   
   Does this PR introduce any user interface change?
   No
   
   Is any new testcase added?
   No
   
    ### Why is this PR needed?
   
   
    ### What changes were proposed in this PR?
   
       
    ### Does this PR introduce any user interface change?
    - No
    - Yes. (please explain the change and update document)
   
    ### Is any new testcase added?
    - No
    - Yes
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-576741787
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1719/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-576751854
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-576789069
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1720/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-576952536
 
 
   @jackylk @niuge01 please review

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#discussion_r369389522
 
 

 ##########
 File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonInsertFromStageCommand.scala
 ##########
 @@ -431,8 +318,53 @@ case class CarbonInsertFromStageCommand(
     }.map { future =>
       future.get()
     }
+  }
+
+  /**
+   * Delete stage file and success file with retry
+   */
+  private def deleteStageFilesWithRetry(
+      executorService: ExecutorService,
+      _stageFiles: Array[(CarbonFile, CarbonFile)]): Unit = {
+    val startTime = System.currentTimeMillis()
+    var retryCount = DELETE_STAGEFILES_RETRY_TIMES
+    var stageFiles = _stageFiles
+    while (stageFiles.length > 0 && retryCount > 0) {
+      retryCount -= 1
+      // 1) delete stage files
+      deleteStageFiles(executorService, stageFiles)
+      // 2) ensure stage files are all deleted
+      stageFiles = ensureStageFilesAreAllDeleted(executorService, stageFiles)
 
 Review comment:
   I think this will slow down the performance, can we check it by using the return value of delete operation in `deleteStageFiles`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#discussion_r369389628
 
 

 ##########
 File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonInsertFromStageCommand.scala
 ##########
 @@ -431,8 +318,53 @@ case class CarbonInsertFromStageCommand(
     }.map { future =>
       future.get()
     }
+  }
+
+  /**
+   * Delete stage file and success file with retry
+   */
+  private def deleteStageFilesWithRetry(
+      executorService: ExecutorService,
+      _stageFiles: Array[(CarbonFile, CarbonFile)]): Unit = {
+    val startTime = System.currentTimeMillis()
+    var retryCount = DELETE_STAGEFILES_RETRY_TIMES
+    var stageFiles = _stageFiles
+    while (stageFiles.length > 0 && retryCount > 0) {
+      retryCount -= 1
+      // 1) delete stage files
+      deleteStageFiles(executorService, stageFiles)
+      // 2) ensure stage files are all deleted
+      stageFiles = ensureStageFilesAreAllDeleted(executorService, stageFiles)
+    }
+    LOGGER.info(s"finished to delete stage files, time taken: " +
+      s"${System.currentTimeMillis() - startTime}ms")
+    // if there are still stage files failed to clean, print log.
+    if (stageFiles.length > 0) {
+      LOGGER.warn(s"failed to clean up stage files:" + stageFiles.map(_._1.getName).mkString(","))
+    }
+  }
+
+
+  /**
+   * ensure stage files are all deleted
+   */
+  private def ensureStageFilesAreAllDeleted(
+      executorService: ExecutorService,
+      stageFiles: Array[(CarbonFile, CarbonFile)]): Array[(CarbonFile, CarbonFile)] = {
+    val startTime = System.currentTimeMillis()
+    stageFiles.map { files =>
+      executorService.submit(new Callable[Boolean] {
+        override def call(): Boolean = {
+          files._1.exists() || files._2.exists()
+        }
+      })
+    }.filter { future =>
+      future.get()
+    }
     LOGGER.info(s"finished to delete stage files, time taken: " +
 
 Review comment:
   This log is incorrect now

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-577044381
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-577052755
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1730/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-577056992
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
jackylk commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-577071759
 
 
   add to whitelist

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-577106050
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1733/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on a change in pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure commented on a change in pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#discussion_r369993207
 
 

 ##########
 File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/management/CarbonInsertFromStageCommand.scala
 ##########
 @@ -431,8 +318,53 @@ case class CarbonInsertFromStageCommand(
     }.map { future =>
       future.get()
     }
+  }
+
+  /**
+   * Delete stage file and success file with retry
+   */
+  private def deleteStageFilesWithRetry(
+      executorService: ExecutorService,
+      _stageFiles: Array[(CarbonFile, CarbonFile)]): Unit = {
+    val startTime = System.currentTimeMillis()
+    var retryCount = DELETE_STAGEFILES_RETRY_TIMES
+    var stageFiles = _stageFiles
+    while (stageFiles.length > 0 && retryCount > 0) {
+      retryCount -= 1
+      // 1) delete stage files
+      deleteStageFiles(executorService, stageFiles)
+      // 2) ensure stage files are all deleted
+      stageFiles = ensureStageFilesAreAllDeleted(executorService, stageFiles)
 
 Review comment:
   solved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
jackylk commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-578377708
 
 
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk removed a comment on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
jackylk removed a comment on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-578377708
 
 
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
jackylk commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-578385388
 
 
   please rebase, there are conflicts

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-578506563
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589#issuecomment-578509017
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1766/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure closed pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure closed pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure opened a new pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…

GitBox
In reply to this post by GitBox
marchpure opened a new pull request #3589: [CARBONDATA-3667] Insert stage recover processing of the partition ta…
URL: https://github.com/apache/carbondata/pull/3589
 
 
   …ble throw exception "the unexpected 0 segment found"
   
    ### Why is this PR needed?
   Recover processing when executing insert stage a partition table throw exception "the unexpected 0 segment found"
   The reason is the snapshot content of partition table is wrong.
   
    ### What changes were proposed in this PR?
   Let's review the recover processing's purpose, when the segment is loaded successfully, but fail to clean the stage files, the stage(data) will be loaded again which leads to repetition of data. Before, we expect to save a snapshot file consists of segment id and stage file list, the stage files failed to clean can be clean again when we found the corresponding segment has been loaded successfully. But the segmentid can't be achieve while loading partition table, in the other words, we can't get the right snapshot for partition table.
   After discussion, we believe that this problem can also be alleviated by adding retry mechanism when deleting stage files. Thus, there are two changes were proposed which is shown below.
   1) Remove recover processing of Insert stage command.
   2) Add Retry to delete stage files.
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - No
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
12