TangLin created CARBONDATA-3976:
-----------------------------------
Summary: CarbonData Update operation enhancement
Key: CARBONDATA-3976
URL:
https://issues.apache.org/jira/browse/CARBONDATA-3976 Project: CarbonData
Issue Type: Improvement
Components: data-load
Reporter: TangLin
*Background*
Update operation will clean up delta files before update( see
cleanUpDeltaFiles(carbonTable, false)), It's loop traversal metadata path
and segment path many times. When there are too many files, the overhead
will increase and update time will be longer.
*Motivation & Goal*
During the update process, reduce loop traversal or remove cleanUpDelteFiles
to another method.
*Modification*
There are some solutions as following.
Solution 1:
In cleanUpDeltaFiles have some same points in get files method, like
updateStatusManager.getUpdateDeltaFilesList(segment,
false,CarbonCommonConstants.UPDATE_DELTA_FILE_EXT, true,
allSegmentFiles,true) and
updateStatusManager.getUpdateDeltaFilesList(segment,
false,CarbonCommonConstants.UPDATE_INDEX_FILE_EXT, true,
allSegmentFiles,true), They are just different file types,but loop traversal
segment path twice. we can merge it.
Solution 2:
Base solution 1,Use Spark or MapReduce to hand over tasks to other nodes.
Solution 3:
Submit cleanUpDelaFiles to another task, process them in the early morning
or when the cluster is not busy.
Solution 4:
Establish a garbage collection bin, which provides some interfaces for our
program to determine when files enter the garbage collection bin and how to
deal with them.
Please vote for all solutions.
Best Regards,
LinWood
--
This message was sent by Atlassian Jira
(v8.3.4#803005)