> Parallelize block pruning of default datamap in driver for filter query processing
> -----------------------------------------------------------------------------------
>
> Key: CARBONDATA-3118
> URL:
https://issues.apache.org/jira/browse/CARBONDATA-3118> Project: CarbonData
> Issue Type: Improvement
> Reporter: Ajantha Bhat
> Assignee: Ajantha Bhat
> Priority: Major
> Fix For: 1.5.1
>
> Time Spent: 7.5h
> Remaining Estimate: 0h
>
> *"Parallelize block pruning of default datamap in driver
> for filter query processing"*
> *Background:*
> We do block pruning for the filter queries at the driver side.
> In real time big data scenario, we can have millions of carbon files for
> one carbon table.
> It is currently observed that for 1 million carbon files it takes around 5
> seconds for block pruning. As each carbon file takes around 0.005ms for
> pruning
> (with only one filter columns set in 'column_meta_cache' tblproperty).
> If the files are more, we might take more time for block pruning.
> Also, spark Job will not be launched until block pruning is completed. so,
> the user will not know what is happening at that time and why spark job is
> not launching.
> currently, block pruning is taking time as each segment processing is
> happening sequentially. we can reduce the time by parallelizing it.
> *solution:*Keep default number of threads for block pruning as 4.
> User can reduce this number by a carbon property
> "carbon.max.driver.threads.for.pruning" to set between -> 1 to 4.
> In TableDataMap.prune(),
> group the segments as per the threads by distributing equal carbon files to
> each thread.
> Launch the threads for a group of segments to handle block pruning.