Hi all,
Currently CarbonData supports Compaction for all sort scopes based on their
taskIds, i.e, we group the partitions(carbondata files) of different
segments which have the same taskId to one task and then compact. But this
would not be the correct way to handle the compaction in the case of Range
Sort where we have data divided into different ranges for different
segments. So we may group different ranges' data into one range which may
not be correct.
For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has
2 ranges (50-150) and (250-300); so here if we combine based on taskIds we
will get a wrong grouping after compaction.
So we can solve this problem by merging the overlapping intervals and
getting new intervals(ranges) out of them. After this we can assign each
task approximately same amount of data by dividing on the basis of sizes of
the ranges. After this we can continue as the normal data load flow of Range
Column at each task.
Any suggestions from the community will be greatly appreciated. I would be
uploading the design doc shortly.
Thanks and regards
Manish Nalla
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/