Backgrounds
Currently in data management scenarios(Data Loading,Segements Compaction .etc) there exist some data deletion actions. And these actions are dangerous because they are written in different place and some corner case will cause data deletion accidently. <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image1.png> Current Data Deletion in Data Loading process Firstly, introduce to the current data loading processing 1. Delete Stale Segments This method will delete the segments which are not compatible with table status. In loading flow, this method will scan the all the segments and add the original segments(like Segment_1, do not contains "." in part[1]) to staleSegments list, then delete the segments in staleSegments lists. <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image2.png> 2. Delete Invalid Segments There will be 3 steps in Delete Invalid Segments (1) Delete Expire Lock This method will delete the expired locks (>48h) (2) Check if the data need to be deleted, and move segments to proper place In current design, it will scan and remove 4 status of Segments(MARK_FOR_DELETE, COMPACTED, INSERT_IN_PROGRESS, INSERT_OVERWRITE_IN_PROGRESS),if it comes from loading flow to this deletion method, it will scan the segments, if meet the requirement to be deleted, and invisibleSegmentCnt > invisibleSegmentPreserveCnt, it will be added to history file and then be delete. <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image3.png> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image4.png> <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t431/image5.png> (3) Delete Invalid Data In the final step, it will delete the data file which are moved to the history file. 3. Delete temporary files In default setting, in loading process, CarbonData will write to temp file first and copy to target path in the end of loading. This method will delete the tempfiles. Data Deletion Hotfix in Loading Process By analysing the deletion actions during the loading process, we are going to make some modification to the loading flow deletion to keep data being deleted by accident. There are some step to fix the problem: (1) Replace the stale cleaning function by CleanFile actions. (2) Ignoring the segments which status are INSERT_IN_PROGREE and INSERT_OVERWRITE_IN_PROGRESS, bacause the loading progress might take a long time in a high concurrent situation. This two kind of segments will leave to be deleted by the command of CleanFiles. Besides, there will a recycle bin to store the deleted files temporaryly, users can find their deleted segments at recycle bin. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hello March,
I agree to take a hotfix for data deletion in loading and compaction flow, +1. Deleting the INSERT_IN_PROGERSS and INSERT_OVERWRITE_IN_PROGRESS is a dangerous activity, so these two kinds of segments should not be automatically deleted. As for MARKED_FOR_DELETE and COMPACTED status segments, these are stale segments, but we can keep them in the file system until the user/admin calls clean file action manually. Since the deletion requires the precision of the table status. So my opinion is to remove all the automatic clean steps in loading/compaction flow first to protect the data from being deleted accidentally. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |