Login  Register

Re: [discuss]CarbonData update operation enhance

Posted by Liang Chen on Sep 22, 2020; 12:20pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/discuss-CarbonData-update-operation-enhance-tp99861p100804.html

Hi

Thank you started this discussion.
This proposal is for improving data updation performance, right ?

Regards
Liang


Linwood wrote

> *[Background]*
> Update operation will clean up delta files before update( see
> cleanUpDeltaFiles(carbonTable, false)), It's loop traversal metadata path
> and segment path many times. When there are too many files, the overhead
> will increase and update time will be longer.
>
> *[Motivation & Goal]*
> During the update process, reduce loop traversal or remove
> cleanUpDelteFiles
> to another method.
>
> *[Modification]*
> There are some solutions as following.
>
> Solution 1:
>
> In cleanUpDeltaFiles have some same points in get files method, like
> updateStatusManager.getUpdateDeltaFilesList(segment,
> false,CarbonCommonConstants.UPDATE_DELTA_FILE_EXT, true,
> allSegmentFiles,true) and
> updateStatusManager.getUpdateDeltaFilesList(segment,
> false,CarbonCommonConstants.UPDATE_INDEX_FILE_EXT, true,
> allSegmentFiles,true), They are just different file types,but loop
> traversal
> segment path twice. we can merge it.
>
> Solution 2:
>
> Base solution 1,Use Spark or MapReduce to hand over tasks to other nodes.
>
> Solution 3:
>
> Submit cleanUpDelaFiles  to another task, process them in the early
> morning
> or when the cluster is not busy.
>
> Solution 4:
>
> Establish a garbage collection bin, which provides some interfaces for our
> program to determine when files enter the garbage collection bin and how
> to
> deal with them.
>
> Please vote for all solutions.
>
> Best Regards,
> LinWood
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/