Hi Community,
Carbondata has advantages with respect to the metadata we save which is used in so many ways to improve the performance with query, load etc. So I think we need to leverage the metadata we store to improve the query performance especially Join performance. Let's assume we have a query of joining two tables t1 and t2 without any filter condition just with the join keys. Then both table would be scanned fully and then joined based on join key. but if the left table is too big, it takes a lot of time. So what if we take the min-max of the right table and apply as between or range filter (As we store the min-max of each segment in the segment file, we can use these info to apply filter) on left table and scan less data which would improves join performance. I have attached a doc with some examples, please check and let me know please give your feedback and any other inputs/suggestions to go ahead. Thanks, Regards, Akash R Nilugal Query_Join_Pruning.docx (344K) Download Attachment |
please note below points addition to above
1. There is a jira in spark similar what i have raised, https://issues.apache.org/jira/browse/SPARK-27227 they are also aimed at same, but its still in progress and target for spark 3.1.0. Here they plan to first execute a query on right table to get the min max, bloom index like that and apply to left, still the design in review, can go through once. We can look more deeper into it once. 2. https://www.qubole.com/blog/enhance-spark-performance-with-dynamic-filtering/ This is also similar one but its in private version, So please consider this also. With the above info and our segmentinfo meta, or may be we do store in cache once we scan the small table. we can use that info to reduce scan for big table. As we still do not have spark 3 integration and still dynamic filtering is in design phase. Please give your inputs, we can discuss further. Thanks Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi Akash,
*Just my opinion*, once the spark supports it, we can handle it in carbon if something needs to be supported. *Doing this change independent of spark can make us lose the advantage once spark brings it as default. * Qubole's dynamic filtering is already merged in prestosql and this will be merged in spark also as it is beneficial. So, Maybe we can first support spark 3.X with carbon (which will first bring the DPP [dynmic partition pruning] optimization) and handle dynamic filtering when spark supports it. Thanks, Ajantha On Tue, Nov 10, 2020 at 3:28 PM akashrn5 <[hidden email]> wrote: > please note below points addition to above > > 1. There is a jira in spark similar what i have raised, > > https://issues.apache.org/jira/browse/SPARK-27227 > they are also aimed at same, but its still in progress and target for spark > 3.1.0. > Here they plan to first execute a query on right table to get the min max, > bloom index like that and > apply to left, still the design in review, can go through once. > We can look more deeper into it once. > > 2. > > https://www.qubole.com/blog/enhance-spark-performance-with-dynamic-filtering/ > This is also similar one but its in private version, > So please consider this also. > > With the above info and our segmentinfo meta, or may be we do store in > cache > once we scan the small table. we can use that info to reduce scan for big > table. > As we still do not have spark 3 integration and still dynamic filtering is > in design phase. > > Please give your inputs, we can discuss further. > > Thanks > > Akash R > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |