[proposal] Parallelize block pruning of default datamap in driver for filter query processing.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[proposal] Parallelize block pruning of default datamap in driver for filter query processing.

Ajantha Bhat
Hi all,
I want to propose *"Parallelize block pruning of default datamap in driver
for filter query processing"*

*Background:*
We do block pruning for the filter queries at the driver side.
In real time big data scenario, we can have millions of carbon files for
one carbon table.
It is currently observed that for 1 million carbon files it takes around 5
seconds for block pruning. As each carbon file takes around 0.005ms for
pruning
(with only one filter columns set in 'column_meta_cache' tblproperty).
If the files are more, we might take more time for block pruning.
Also, spark Job will not be launched until block pruning is completed. so,
the user will not know what is happening at that time and why spark job is
not launching.
currently, block pruning is taking time as each segment processing is
happening sequentially. we can reduce the time by parallelizing it.


*solution:*Keep default number of threads for block pruning as 4.
User can reduce this number by a carbon property
"carbon.max.driver.threads.for.pruning" to set between -> 1 to 4.

In TableDataMap.prune(),

group the segments as per the threads by distributing equal carbon files to
each thread.
Launch the threads for a group of segments to handle block pruning.

Thanks,
Ajantha
Reply | Threaded
Open this post in threaded view
|

Re: [proposal] Parallelize block pruning of default datamap in driver for filter query processing.

xuchuanyin
'Parallelize pruning' is in my plan long time ago, nice to see your proposal
here.

While implementing this, I'd like you to make it common, that is to say not
only default datamap but also other index datamaps can also use parallelize
pruning.



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [proposal] Parallelize block pruning of default datamap in driver for filter query processing.

Ajantha Bhat
@xuchuanyin
Yes, I will be handling this for all types of datamap pruning in the same
flow when I am done with default datamap's implementation and testing.

Thanks,
Ajantha



On Fri, Nov 23, 2018 at 6:36 AM xuchuanyin <[hidden email]> wrote:

> 'Parallelize pruning' is in my plan long time ago, nice to see your
> proposal
> here.
>
> While implementing this, I'd like you to make it common, that is to say not
> only default datamap but also other index datamaps can also use parallelize
> pruning.
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: [proposal] Parallelize block pruning of default datamap in driver for filter query processing.

ravipesala
+1, It will be helpful for pruning millions of data files in less time.
Please try to generalize for all datamaps.

Thanks & Regards
Ravindra

On Fri, 23 Nov 2018 at 10:24, Ajantha Bhat <[hidden email]> wrote:

> @xuchuanyin
> Yes, I will be handling this for all types of datamap pruning in the same
> flow when I am done with default datamap's implementation and testing.
>
> Thanks,
> Ajantha
>
>
>
> On Fri, Nov 23, 2018 at 6:36 AM xuchuanyin <[hidden email]> wrote:
>
> > 'Parallelize pruning' is in my plan long time ago, nice to see your
> > proposal
> > here.
> >
> > While implementing this, I'd like you to make it common, that is to say
> not
> > only default datamap but also other index datamaps can also use
> parallelize
> > pruning.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


--
Thanks & Regards,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: [proposal] Parallelize block pruning of default datamap in driver for filter query processing.

xubo245
+1,  Whether will it affect the SDK/CSDK reader after parallelizing block
pruning? please check. SDK and CSDK need keep the carbon files
sequence/order



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/