[DISCUSSION] Support Compaction for Range Sort

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSSION] Support Compaction for Range Sort

ManishNalla1994
This post was updated on .
Hi all,

Currently CarbonData supports Compaction for all sort scopes based on their
taskIds, i.e, we group the partitions(carbondata files) of different
segments which have the same taskId to one task and then compact. But this
would not be the correct way to handle the compaction in the case of Range
Sort where we have data divided into different ranges for different
segments. So we may group different ranges' data into one range which may
not be correct.

For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has
2 ranges (50-150) and (250-300); so here if we combine based on taskIds we
will get a wrong grouping after compaction.

So we can solve this problem by merging the overlapping intervals and
getting new intervals(ranges) out of them. After this we can assign each
task approximately same amount of data by dividing on the basis of sizes of
the ranges. After this we can continue as the normal data load flow of Range
Column at each task.

Any suggestions from the community will be greatly appreciated. I would be
uploading the design doc shortly.

Thanks and regards
Manish Nalla



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Support Compaction for Range Sort

xuchuanyin
Hi ManishNalla:

"""
merging the overlapping intervals and getting new intervals(ranges) out of
them
"""
===
What do you mean by saying this? Can you give an example for it.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Support Compaction for Range Sort

ManishNalla1994
Hi Xuchuanyin,

Thanks for asking the doubt.

Consider the following example where we have 2 segments:

Seg_0 : R1(0-100), R2(100-200), R3(200-300)
Seg_1 : R4(0-50), R5(150-250), R6(250-350)

Now the new ranges formed after merging all the overlapping ranges for both
the segments will be:

R1(0-100) & R2(100-350).
Now these two have been formed as new ranges, but we can further divide
these ranges into smaller ranges based on their sizes to be distributed into
different tasks. More explanation will be given in design document for other
cases.

Thanks.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Support Compaction for Range Sort

ManishNalla1994
In reply to this post by ManishNalla1994
Hi all,

https://issues.apache.org/jira/browse/CARBONDATA-3343

Here is the JIRA link for this feature. Design Doc attached in the JIRA.

Thanks and Regards,
Manish Nalla



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Support Compaction for Range Sort

David CaiQiang
In reply to this post by ManishNalla1994
How will it compact Seg_0 and Seg_1 in the new compaction?

For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has
2 ranges (50-150) and (250-300);



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Support Compaction for Range Sort

ManishNalla1994
Hi David,

Here in the case which you gave the ranges will combine into 1 common range,
that is (0-300). So now we will check on how many tasks(default parallelism)
we have to make and divide into those many ranges. Suppose we have 3
executors and 1 core each so the parallelism is 3 here so we can divide the
one range into 3 ranges as (0-100), (100-200) and (200-300) as our final
ranges.

Thanks and regards,
Manish Nalla



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/