Apache CarbonData Dev Mailing List archive

[DISCUSSION] Support Compaction for Range Sort

Classic

List

Threaded

6 messages Options

ManishNalla1994

[DISCUSSION] Support Compaction for Range Sort

This post was updated on .

Hi all,

Currently CarbonData supports Compaction for all sort scopes based on their
taskIds, i.e, we group the partitions(carbondata files) of different
segments which have the same taskId to one task and then compact. But this
would not be the correct way to handle the compaction in the case of Range
Sort where we have data divided into different ranges for different
segments. So we may group different ranges' data into one range which may
not be correct.

For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has
2 ranges (50-150) and (250-300); so here if we combine based on taskIds we
will get a wrong grouping after compaction.

So we can solve this problem by merging the overlapping intervals and
getting new intervals(ranges) out of them. After this we can assign each
task approximately same amount of data by dividing on the basis of sizes of
the ranges. After this we can continue as the normal data load flow of Range
Column at each task.

Any suggestions from the community will be greatly appreciated. I would be
uploading the design doc shortly.

Thanks and regards
Manish Nalla

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

xuchuanyin

Re: [DISCUSSION] Support Compaction for Range Sort

Hi ManishNalla:

"""
merging the overlapping intervals and getting new intervals(ranges) out of
them
"""
===
What do you mean by saying this? Can you give an example for it.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ManishNalla1994

Re: [DISCUSSION] Support Compaction for Range Sort

Hi Xuchuanyin,

Thanks for asking the doubt.

Consider the following example where we have 2 segments:

Seg_0 : R1(0-100), R2(100-200), R3(200-300)
Seg_1 : R4(0-50), R5(150-250), R6(250-350)

Now the new ranges formed after merging all the overlapping ranges for both
the segments will be:

R1(0-100) & R2(100-350).
Now these two have been formed as new ranges, but we can further divide
these ranges into smaller ranges based on their sizes to be distributed into
different tasks. More explanation will be given in design document for other
cases.

Thanks.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ManishNalla1994

Re: [DISCUSSION] Support Compaction for Range Sort

In reply to this post by ManishNalla1994

Hi all,

https://issues.apache.org/jira/browse/CARBONDATA-3343

Here is the JIRA link for this feature. Design Doc attached in the JIRA.

Thanks and Regards,
Manish Nalla

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

David CaiQiang

Re: [DISCUSSION] Support Compaction for Range Sort

In reply to this post by ManishNalla1994

How will it compact Seg_0 and Seg_1 in the new compaction?

For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has
2 ranges (50-150) and (250-300);

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

ManishNalla1994

Re: [DISCUSSION] Support Compaction for Range Sort

Hi David,

Here in the case which you gave the ranges will combine into 1 common range,
that is (0-300). So now we will check on how many tasks(default parallelism)
we have to make and divide into those many ranges. Suppose we have 3
executors and 1 core each so the parallelism is 3 here so we can divide the
one range into 3 ranges as (0-100), (100-200) and (200-300) as our final
ranges.

Thanks and regards,
Manish Nalla

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/