Posted by
haomarch on
Sep 15, 2020; 8:28am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/CarbonData-File-Deletion-Hotfix-tp100384.html
Backgrounds
Currently in data management scenarios(Data Loading,Segements Compaction
.etc) there exist some data deletion actions. And these actions are
dangerous because they are written in different place and some corner case
will cause data deletion accidently.
<
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image1.png>
Current Data Deletion in Data Loading process
Firstly, introduce to the current data loading processing
1. Delete Stale Segments
This method will delete the segments which are not compatible with table
status.
In loading flow, this method will scan the all the segments and add the
original segments(like Segment_1, do not contains "." in part[1]) to
staleSegments list, then delete the segments in staleSegments lists.
<
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image2.png>
2. Delete Invalid Segments
There will be 3 steps in Delete Invalid Segments
(1) Delete Expire Lock
This method will delete the expired locks (>48h)
(2) Check if the data need to be deleted, and move segments to proper place
In current design, it will scan and remove 4 status of
Segments(MARK_FOR_DELETE, COMPACTED, INSERT_IN_PROGRESS,
INSERT_OVERWRITE_IN_PROGRESS),if it comes from loading flow to this deletion
method, it will scan the segments, if meet the requirement to be deleted,
and invisibleSegmentCnt > invisibleSegmentPreserveCnt, it will be added to
history file and then be delete.
<
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image3.png>
<
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image4.png>
<
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image5.png>
(3) Delete Invalid Data
In the final step, it will delete the data file which are moved to the
history file.
3. Delete temporary files
In default setting, in loading process, CarbonData will write to temp file
first and copy to target path in the end of loading. This method will delete
the tempfiles.
Data Deletion Hotfix in Loading Process
By analysing the deletion actions during the loading process, we are going
to make some modification to the loading flow deletion to keep data being
deleted by accident.
There are some step to fix the problem:
(1) Replace the stale cleaning function by CleanFile actions.
(2) Ignoring the segments which status are INSERT_IN_PROGREE and
INSERT_OVERWRITE_IN_PROGRESS, bacause the loading progress might take a long
time in a high concurrent situation. This two kind of segments will leave to
be deleted by the command of CleanFiles. Besides, there will a recycle bin
to store the deleted files temporaryly, users can find their deleted
segments at recycle bin.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/