Pickupolddriver opened a new pull request #3935: URL: https://github.com/apache/carbondata/pull/3935 ### Why is this PR needed? Currently, in data management scenarios(Data Loading, Segments Compaction.etc), there exist some data deletion actions. And these actions are dangerous because they are written in different places and some corner cases will cause data deletion accidentally. ### What changes were proposed in this PR? This PR remove deleting INSERT_IN_PROGRESS and INSERT_OVERWRITE_IN_PROGRESS status segments during loading and compaction flow. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693346444 Can one of the admins verify this patch? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
jackylk commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693387784 add to whitelist ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693391021 Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2362/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693391877 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4104/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693793107 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4112/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693794609 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2370/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Zhangshunyu commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694079287 From this pr we can see some delete actions are removed, as a result there would be some unused files left, any idea about how to handle the left files? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Pickupolddriver commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694591124 > From this pr we can see some delete actions are removed, as a result there would be some unused files left, any idea about how to handle the left files? The stale data will remain there until the user calls clean files action manually. Since the data will be check by the table status file, the stale data will not do harm to data management. Since data deleted by accident is a critical problem, but the stale data exist is acceptable. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
akashrn5 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694704817 during insert or load, if you dont clean up still fine, but if you remove `deletePartialLoadsInCompaction` directly, it can create problem of load or query failures or wrong data in query or extra data. This is because we dont enter any entry before compaction in table status in order to maintain the segment ID logic of compaction. So its dangerous to remove this blindly. I remember @ajantha-bhat faced similar issues recently during compaction. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694708624 Agree with @Zhangshunyu and @akashrn5 a) When compaction retried as it use the same segment ID, if stale files are not cleaned. It gives duplicate data. So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry. b) please check and move the logic of deletePartialLoadsInCompaction in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files c) Also if the purpose of this PR at deleting accidental data loss. you need to handle `cleanStaleDeltaFiles` in` CarbonUpdateUtil.java` and also identify other places. Just handling in once place will not guarantee that we cannot have data loss. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat edited a comment on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694708624 Agree with @Zhangshunyu and @akashrn5 a) When compaction retries, it uses the same segment ID, if stale files are not cleaned. It gives duplicate data. So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry. b) please check and move the logic of deletePartialLoadsInCompaction in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files c) Also if the purpose of this PR at deleting accidental data loss. you need to handle `cleanStaleDeltaFiles` in` CarbonUpdateUtil.java` and also identify other places. Just handling in once place will not guarantee that we cannot have data loss. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat edited a comment on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694708624 Agree with @Zhangshunyu and @akashrn5 a) When compaction retries, it uses the same segment ID, if stale files are not cleaned. It gives duplicate data. So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry. b) please check and move the logic of `deletePartialLoadsInCompaction` in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files. c) Also if the purpose of this PR is to avoid accidental data loss. you need to handle `cleanStaleDeltaFiles` in` CarbonUpdateUtil.java` and also identify other places. Just handling in once place will not guarantee that we cannot have data loss. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#discussion_r490760218 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ########## @@ -267,9 +266,8 @@ object CarbonDataRDDFactory { throw new Exception("Exception in compaction " + exception.getMessage) } } finally { - executor.shutdownNow() try { - compactor.deletePartialLoadsInCompaction() Review comment: a) When compaction retries, it uses the same segment ID, if stale files are not cleaned. It gives duplicate data. So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry. b) please check and move the logic of deletePartialLoadsInCompaction in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files. c) Also if the purpose of this PR is to avoid accidental data loss. you need to handle `cleanStaleDeltaFiles` in `CarbonUpdateUtil.java` and also identify other places. Just handling in few place will not guarantee that we cannot have data loss. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-699612006 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2495/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-699612222 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4239/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
Pickupolddriver commented on a change in pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#discussion_r502735735 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ########## @@ -267,9 +266,8 @@ object CarbonDataRDDFactory { throw new Exception("Exception in compaction " + exception.getMessage) } } finally { - executor.shutdownNow() try { - compactor.deletePartialLoadsInCompaction() Review comment: (a). My PR will dependent on #3934 , if it merged first we don't need to worry about the same segment ID in compaction situations. So it the deletePartialLoadsInCompaction could be deleted. (b). In this PR's design, the stale files would not be deleted automatically. Users need to call cleanFiles function to clean them besides PR #3917 can handle the clean files function. (c). More places of calling claenStaleFiles have been removed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-706487703 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4352/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
CarbonDataQA1 commented on pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#issuecomment-706488603 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2602/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
In reply to this post by GitBox
QiangCai commented on a change in pull request #3935: URL: https://github.com/apache/carbondata/pull/3935#discussion_r503018114 ########## File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ########## @@ -267,9 +266,8 @@ object CarbonDataRDDFactory { throw new Exception("Exception in compaction " + exception.getMessage) } } finally { - executor.shutdownNow() try { - compactor.deletePartialLoadsInCompaction() Review comment: a) root cause of stale files: It uses listFiles to collect index files when writing the segment file, so it will add stale index file names into segment file. it is hard to add a unique id, so we change the plan to avoid to listFiles during wirte segment file. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] |
Free forum by Nabble | Edit this page |