CarbonData File Deletion Hotfix

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

CarbonData File Deletion Hotfix

haomarch
Backgrounds

Currently in data management scenarios(Data Loading,Segements Compaction
.etc) there exist some data deletion actions. And these actions are
dangerous because they are written in different place and some corner case
will cause data deletion accidently.

<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image1.png>


Current Data Deletion in Data Loading process

Firstly, introduce to  the current data loading processing 

1. Delete Stale Segments

This method will delete the segments which are not compatible with table
status.  

In loading flow, this method will scan the all the segments and add the
original segments(like Segment_1, do not contains "." in part[1]) to
staleSegments list, then delete the segments in staleSegments lists.
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image2.png>


2. Delete Invalid Segments



There will be 3 steps in Delete Invalid Segments
(1) Delete Expire Lock

This method will delete the expired locks (>48h)

(2) Check if the data need to be deleted, and move segments to proper place




In current design, it will scan and remove 4 status of
Segments(MARK_FOR_DELETE, COMPACTED, INSERT_IN_PROGRESS,
INSERT_OVERWRITE_IN_PROGRESS),if it comes from loading flow to this deletion
method, it will scan the segments, if meet the requirement to be deleted,
and invisibleSegmentCnt > invisibleSegmentPreserveCnt, it will be added to
history file and then be delete.
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image3.png>
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image4.png>
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image5.png>











(3) Delete Invalid Data

In the final step, it will delete the data file which are moved to the
history file.







3. Delete temporary files

In default setting, in loading process, CarbonData will write to temp file
first and copy to target path in the end of loading. This method will delete
the tempfiles.




Data Deletion Hotfix in Loading Process

By analysing the deletion  actions during the loading process, we are going
to make some modification to the loading flow deletion to keep data being
deleted by accident. 

There are some step to fix the problem:

(1) Replace the stale cleaning function by CleanFile actions. 

(2) Ignoring the segments which status are INSERT_IN_PROGREE and
INSERT_OVERWRITE_IN_PROGRESS, bacause the loading progress might take a long
time in a high concurrent situation. This two kind of segments will leave to
be deleted by the command of CleanFiles. Besides, there will a recycle bin
to store the deleted files temporaryly, users can find their deleted
segments at recycle bin. 









--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: CarbonData File Deletion Hotfix

BrooksLi
Hello March,

I agree to take a hotfix for data deletion in loading and compaction flow,
+1.  

Deleting the INSERT_IN_PROGERSS and INSERT_OVERWRITE_IN_PROGRESS is a
dangerous activity, so these two kinds of segments should not be
automatically deleted.

As for MARKED_FOR_DELETE and COMPACTED status segments, these are stale
segments, but we can keep them in the file system until the user/admin calls
clean file action manually.  Since the deletion requires the precision of
the table status.

So my opinion is to remove all the automatic clean steps in
loading/compaction flow first to protect the data from being deleted
accidentally.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/