Apache CarbonData Dev Mailing List archive

[DISCUSSION]Join optimization with Carbondata's metadata

Classic

List

Threaded

3 messages Options

akashnilugal@gmail.com

Nov 10, 2020; 8:30am

[DISCUSSION]Join optimization with Carbondata's metadata

40 posts

Hi Community,

Carbondata has advantages with respect to the metadata we save which is used in so many ways to improve the performance

with query, load etc. So I think we need to leverage the metadata we store to improve the query performance especially

Join performance.

Let's assume we have a query of joining two tables t1 and t2 without any filter condition just with the join keys.

Then both table would be scanned fully and then joined based on join key.

but if the left table is too big, it takes a lot of time. So what if we take the min-max of the right table and apply as between or range filter

(As we store the min-max of each segment in the segment file, we can use these info to apply filter)

on left table and scan less data which would improves join performance.

I have attached a doc with some examples, please check and let me know

please give your feedback and any other inputs/suggestions to go ahead.

Thanks,

Regards,

Akash R Nilugal

Query_Join_Pruning.docx (344K) Download Attachment

akashrn5

Nov 10, 2020; 9:58am

Re: [DISCUSSION]Join optimization with Carbondata's metadata

100 posts

please note below points addition to above

1. There is a jira in spark similar what i have raised,

https://issues.apache.org/jira/browse/SPARK-27227
they are also aimed at same, but its still in progress and target for spark
3.1.0.
Here they plan to first execute a query on right table to get the min max,
bloom index like that and
apply to left, still the design in review, can go through once.
We can look more deeper into it once.

2.
https://www.qubole.com/blog/enhance-spark-performance-with-dynamic-filtering/
This is also similar one but its in private version,
So please consider this also.

With the above info and our segmentinfo meta, or may be we do store in cache
once we scan the small table. we can use that info to reduce scan for big
table.
As we still do not have spark 3 integration and still dynamic filtering is
in design phase.

Please give your inputs, we can discuss further.

Thanks

Akash R

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Nov 10, 2020; 10:14am

Re: [DISCUSSION]Join optimization with Carbondata's metadata

101 posts

Hi Akash,
*Just my opinion*, once the spark supports it, we can handle it in carbon
if something needs to be supported.
*Doing this change independent of spark can make us lose the advantage once
spark brings it as default. *

Qubole's dynamic filtering is already merged in prestosql and this will be
merged in spark also as it is beneficial.
So, Maybe we can first support spark 3.X with carbon (which will first
bring the DPP [dynmic partition pruning] optimization)
and handle dynamic filtering when spark supports it.

Thanks,
Ajantha

On Tue, Nov 10, 2020 at 3:28 PM akashrn5 <[hidden email]> wrote:

> please note below points addition to above
>
> 1. There is a jira in spark similar what i have raised,
>
> https://issues.apache.org/jira/browse/SPARK-27227
> they are also aimed at same, but its still in progress and target for spark
> 3.1.0.
> Here they plan to first execute a query on right table to get the min max,
> bloom index like that and
> apply to left, still the design in review, can go through once.
> We can look more deeper into it once.
>
> 2.
>
> https://www.qubole.com/blog/enhance-spark-performance-with-dynamic-filtering/
> This is also similar one but its in private version,
> So please consider this also.
>
> With the above info and our segmentinfo meta, or may be we do store in
> cache
> once we scan the small table. we can use that info to reduce scan for big
> table.
> As we still do not have spark 3 integration and still dynamic filtering is
> in design phase.
>
> Please give your inputs, we can discuss further.
>
> Thanks
>
> Akash R
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

... [show rest of quote]