Apache CarbonData Dev Mailing List archive

Re: [Discussion]Presto Queries leveraging Secondary Index

Posted by Ajantha Bhat on Jan 05, 2021; 2:15pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-tp105291p105296.html

Hi Venu,

a. *Presto carbondata support reading bloom index*, so I want to correct
your initial statement "Presto engine do not make use of
indexes(SI, Bloom etc) in query processing"

b. Between option1 and option2 the main difference is *option1 is
multi-threaded and option2 is distributed.*
The performance of the option1 will be bad. Hence even though we need spark
index server cluster (currently presto carbondata always need spark cluster
to write carbondata) *I want to go with option2.*

c. For option2, the implementation you cannot do like bloom as we need to
read the whole SI table with filter, so suggest to make a dataframe by
querying the SI table (which calls CarbonScanRDD) and once you get the
matched blocklets, make a split for main table from that based on block
level or blocklet level task distribution.

Thanks,
Ajantha

On Tue, Jan 5, 2021 at 5:31 PM VenuReddy <[hidden email]> wrote:

> Hi all.!
>
> At present Carbon table queries with Presto engine do not make use of
> indexes(SI, Bloom etc) in query processing. Exploring feasible approaches
> without query plan rewrite to make use of secondary indexes(if any
> available) similar to that of existing datamap.
>
> *
> Option 1:
> * Presto get splits for main table to find the suitable SI table, scan, get
> the position references from SI table and return the splits for main table
> accordingly.
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. Within context of CarbondataSplitManager.getSplits() for main table, in
> CarbonInputFormat.getPrunedBlocklets(), we can do prune with new
> CoarseGrainIndex implementation for SI(similar to that of bloom). Inside
> Prune(), Identify the best suitable SI table, Use SDK CarbonReader to scan
> the identified SI table, get the position references to matching predicate.
> Need to think of reading the table in multiple threads.
> 3. Modify the filter expression to append positionId filter with obtained
> position references from SI table read.
> 4. In the context of CarbondataPageSource, create QueryModel with modified
> filter expression.
> Rest of the processing remains same as before.
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> 2. Can leverage SI with any execution engine.
> *DisAdvantages:*
> 1. Reading SI table in the context of CarbondataSplitManager.getSplits() of
> main table, possibly may degrade the query performance. Need to have enough
> resource to spawn multiple threads for reading within it.
>
> *
> Option 2:
> * Use Index Server to prune(enable distributed pruning).
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. On Index Server, during getSplits() for main table, in the context of
> DistributedPruneRDD.internalCompute()(i.e., on Index server executors)
> within pruneIndexes() can do prune with new CoarseGrainIndex implementation
> for SI(similar to that of bloom). Inside Prune(), Identify the best
> suitable
> SI table, Use CarbonReader to read the SI table, get the position
> references
> to matching predicate.
> 3. Return the extended blocklets for main table
> 4. Need to check how to return/transform filter expression to append
> positionId filter with position references which are read from SI table
> from
> Index Server to Driver along with pruned blocklets??
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> *DisAdvantages:*
> 1. Index Server Executors memory would be occupied for SI table reading.
> 2. Concurrent queries may have impact as Index server is used for SI table
> reading.
> 3. Index Server must be running.
>
> We can introduce a new Carbon property to switch between present and the
> new
> approach being proposed. We may consider the secondary index table storage
> file format change later.
>
> Please let me know your opinion/suggestion if we can go with Option-1 or
> Option-2 or both Option 1 + 2 or any other suggestion ?
>
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>