Re: Clean files enhancement

Posted by Ajantha Bhat on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Clean-files-enhancement-tp100088p100388.html

Hi vikram, Thanks for proposing this.

a) If the file system is HDFS, *HDFS already supports trash.*
when data is deleted in HDFS. It will be moved to trash instead of
permanent delete (can also configure trash interval *fs.trash.interval*)
b) If the file system is object storage like s3a or OBS. *They support
bucket versioning*. The user should configure it to go back to the previous
snapshot.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/undelete-objects.html

*So, Basically this functionality has to be there at underlying file system
not at CarbonData layer. *
Keeping trash folder with many configurations for this and checking aging
of the trash folder can work,
but it makes system complex and adds an additional overhead of maintaining
this functionality.

Based on this,
*-1 from my side for this feature*. you can wait for other people's
opinions on this before concluding.

Thanks,
Ajantha



On Thu, Sep 10, 2020 at 4:20 PM vikramahuja1001 <[hidden email]>
wrote:

> Hi all,
> This mail is regarding enhancing the clean files command.
> Current behaviour : Currently when clean files is called, the segments
> which
> are MARKED_FOR_DELETE or are COMPACTED are deleted and their entries are
> removed from tablestatus file, Fact folder and metadata/segments folder.
>
> Enhancement behaviour idea: In this enhancement the idea is to create a
> trash folder(like Recycle Bin, with 777 config) which can be stored in /tmp
> folder(or user defined folder, a new property will be exposed). Here when
> ever a segment is cleaned , the necessary carbondata files (no other files)
> can be copied to this folder. The RecycleBin folder can have a folder for
> each table with name like DBName_TableName. We can keep the carbondata
> files
> here for 3 days(or as long as the user wants, a carbon property will be
> exposed for the same.). They can be deleted if they are not modified since
> 3
> days or as per the property. We can maintain a thread which checks the
> aging
> time and deletes the necessary carbondata files from the trash folder.
>
> Apart from that, while cleaning INSERT_IN_PROGRESS segments will be cleaned
> too, but will try to get a segment lock before cleaning the
> INSERT_IN_PROGRESS segments. If the code is able to acquire the segment
> lock, i.e., it is a stale folder, it can be cleaned. If the code is not
> able
> to acquire the segment lock that means load is in progress or any other
> operation is in progress, in that case the INSERT_IN_PROGRESS segment will
> not be cleaned.
>
> Please provide input and suggestions for this enhancement idea.
>
> Thanks
> Vikram Ahuja
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>