Re: Clean files enhancement

Posted by kumarvishal09 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Clean-files-enhancement-tp100088p100426.html

Hi Vikram,
Moving to Trash/ keeping inside FACT/Part0/ folder it does not really
matter, finally after configurable time it will be deleted. Moving to Trash
will add an extra IO and time during the data loading.
Everything will work fine if tablestatus is giving correct status. Do not
delete the data physically in automatic clean files, just clean the table
status with proper backup.

For physical deletion, let User calls the clean command. Which will first
run some sanity like getting the count before deletion and then move the
segment to be deleted to some other folder[TRASH] and run the count again.
If both counts matches then delete the data. Otherwise move the data back
from TRASH in case of any mismatch. We need to enhance the current clean
command as per the above way.

-Regards
Kumar Vishal



On Tue, Sep 15, 2020 at 8:50 PM David CaiQiang <[hidden email]> wrote:

> 1. cleaning the in_progressing segment is very dangerous, please remove
> this
> part from code.  After the user explicitly uses clean file command with an
> option "clean_in_progressing"="true", we check segment lock to clean
> segment.
>
> 2. if the status of a semgent is mark_for_delete/compacted, we can delete
> the segment directly without backup.
>
> 3. remove code which clean stale data and partial data from
> loading/compaction/update/delete feature and so on. better to use a uuid as
> segment folder, Let cleaning stale data to be an optional operation. if we
> don't clean stale data, table also can work fine.
>
> 5. trash folder can be under the table path.  each table has a separate
> trash folder. if we clean uncertain data, we can use trash folder to store
> data and use a separate folder for each transcation.
>
>
>
> -----
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
kumar vishal