Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] Pickupolddriver opened a new pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

Classic

List

52 messages Options

Options

123

GitBox

[GitHub] [carbondata] Pickupolddriver opened a new pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

Pickupolddriver opened a new pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935

### Why is this PR needed?
Currently, in data management scenarios(Data Loading, Segments Compaction.etc), there exist some data deletion actions. And these actions are dangerous because they are written in different places and some corner cases will cause data deletion accidentally.

### What changes were proposed in this PR?
This PR remove deleting INSERT_IN_PROGRESS and INSERT_OVERWRITE_IN_PROGRESS status segments during loading and compaction flow.

### Does this PR introduce any user interface change?
- No

### Is any new testcase added?
- No

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693346444

Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] jackylk commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

jackylk commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693387784

add to whitelist

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693391021

Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2362/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693391877

Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4104/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693793107

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4112/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-693794609

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2370/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] Zhangshunyu commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

Zhangshunyu commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694079287

From this pr we can see some delete actions are removed, as a result there would be some unused files left, any idea about how to handle the left files?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] Pickupolddriver commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

Pickupolddriver commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694591124

> From this pr we can see some delete actions are removed, as a result there would be some unused files left, any idea about how to handle the left files?

The stale data will remain there until the user calls clean files action manually. Since the data will be check by the table status file, the stale data will not do harm to data management. Since data deleted by accident is a critical problem, but the stale data exist is acceptable.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] akashrn5 commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

akashrn5 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694704817

during insert or load, if you dont clean up still fine, but if you remove `deletePartialLoadsInCompaction` directly, it can create problem of load or query failures or wrong data in query or extra data. This is because we dont enter any entry before compaction in table status in order to maintain the segment ID logic of compaction. So its dangerous to remove this blindly.

I remember @ajantha-bhat faced similar issues recently during compaction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

ajantha-bhat commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694708624

Agree with @Zhangshunyu and @akashrn5
a) When compaction retried as it use the same segment ID, if stale files are not cleaned. It gives duplicate data.
So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry.
b) please check and move the logic of deletePartialLoadsInCompaction in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files
c) Also if the purpose of this PR at deleting accidental data loss. you need to handle `cleanStaleDeltaFiles` in` CarbonUpdateUtil.java` and also identify other places. Just handling in once place will not guarantee that we cannot have data loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat edited a comment on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

ajantha-bhat edited a comment on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694708624

Agree with @Zhangshunyu and @akashrn5
a) When compaction retries, it uses the same segment ID, if stale files are not cleaned. It gives duplicate data.
So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry.
b) please check and move the logic of deletePartialLoadsInCompaction in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files
c) Also if the purpose of this PR at deleting accidental data loss. you need to handle `cleanStaleDeltaFiles` in` CarbonUpdateUtil.java` and also identify other places. Just handling in once place will not guarantee that we cannot have data loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat edited a comment on pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

ajantha-bhat edited a comment on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-694708624

Agree with @Zhangshunyu and @akashrn5
a) When compaction retries, it uses the same segment ID, if stale files are not cleaned. It gives duplicate data.
So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry.
b) please check and move the logic of `deletePartialLoadsInCompaction` in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files.
c) Also if the purpose of this PR is to avoid accidental data loss. you need to handle `cleanStaleDeltaFiles` in` CarbonUpdateUtil.java` and also identify other places. Just handling in once place will not guarantee that we cannot have data loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3935: [CARBONDATA-3993] Remove deletePartialLoadData in data loading&compaction process

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#discussion_r490760218

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -267,9 +266,8 @@ object CarbonDataRDDFactory {
throw new Exception("Exception in compaction " + exception.getMessage)
}
} finally {
- executor.shutdownNow()
try {
- compactor.deletePartialLoadsInCompaction()

Review comment:
a) When compaction retries, it uses the same segment ID, if stale files are not cleaned. It gives duplicate data.
So, before this change, we need #3934 to be merged which can use a unique segment id for compaction retry.

b) please check and move the logic of deletePartialLoadsInCompaction in clean files command, instead of permanently removing it. If the clean files don't have this logic, it may not able to clean stale files.

c) Also if the purpose of this PR is to avoid accidental data loss. you need to handle `cleanStaleDeltaFiles` in `CarbonUpdateUtil.java` and also identify other places. Just handling in few place will not guarantee that we cannot have data loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove auto data deletion in IUD processs

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-699612006

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2495/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove auto data deletion in IUD processs

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-699612222

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4239/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] Pickupolddriver commented on a change in pull request #3935: [CARBONDATA-3993] Remove auto data deletion in IUD processs

In reply to this post by GitBox

Pickupolddriver commented on a change in pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#discussion_r502735735

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -267,9 +266,8 @@ object CarbonDataRDDFactory {
throw new Exception("Exception in compaction " + exception.getMessage)
}
} finally {
- executor.shutdownNow()
try {
- compactor.deletePartialLoadsInCompaction()

Review comment:
(a). My PR will dependent on #3934 , if it merged first we don't need to worry about the same segment ID in compaction situations. So it the deletePartialLoadsInCompaction could be deleted.

(b). In this PR's design, the stale files would not be deleted automatically. Users need to call cleanFiles function to clean them besides PR #3917 can handle the clean files function.

(c). More places of calling claenStaleFiles have been removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove auto data deletion in IUD processs

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-706487703

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4352/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3935: [CARBONDATA-3993] Remove auto data deletion in IUD processs

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#issuecomment-706488603

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2602/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] QiangCai commented on a change in pull request #3935: [CARBONDATA-3993] Remove auto data deletion in IUD processs

In reply to this post by GitBox

QiangCai commented on a change in pull request #3935:
URL: https://github.com/apache/carbondata/pull/3935#discussion_r503018114

##########
File path: integration/spark/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala
##########
@@ -267,9 +266,8 @@ object CarbonDataRDDFactory {
throw new Exception("Exception in compaction " + exception.getMessage)
}
} finally {
- executor.shutdownNow()
try {
- compactor.deletePartialLoadsInCompaction()

Review comment:
a) root cause of stale files: It uses listFiles to collect index files when writing the segment file, so it will add stale index file names into segment file.
it is hard to add a unique id, so we change the plan to avoid to listFiles during wirte segment file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

123