http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Presto-Queries-leveraging-Secondary-Index-tp105291p105296.html
a. *Presto carbondata support reading bloom index*, so I want to correct
b. Between option1 and option2 the main difference is *option1 is
The performance of the option1 will be bad. Hence even though we need spark
c. For option2, the implementation you cannot do like bloom as we need to
level or blocklet level task distribution.
> Hi all.!
>
> At present Carbon table queries with Presto engine do not make use of
> indexes(SI, Bloom etc) in query processing. Exploring feasible approaches
> without query plan rewrite to make use of secondary indexes(if any
> available) similar to that of existing datamap.
>
> *
> Option 1:
> * Presto get splits for main table to find the suitable SI table, scan, get
> the position references from SI table and return the splits for main table
> accordingly.
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. Within context of CarbondataSplitManager.getSplits() for main table, in
> CarbonInputFormat.getPrunedBlocklets(), we can do prune with new
> CoarseGrainIndex implementation for SI(similar to that of bloom). Inside
> Prune(), Identify the best suitable SI table, Use SDK CarbonReader to scan
> the identified SI table, get the position references to matching predicate.
> Need to think of reading the table in multiple threads.
> 3. Modify the filter expression to append positionId filter with obtained
> position references from SI table read.
> 4. In the context of CarbondataPageSource, create QueryModel with modified
> filter expression.
> Rest of the processing remains same as before.
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> 2. Can leverage SI with any execution engine.
> *DisAdvantages:*
> 1. Reading SI table in the context of CarbondataSplitManager.getSplits() of
> main table, possibly may degrade the query performance. Need to have enough
> resource to spawn multiple threads for reading within it.
>
> *
> Option 2:
> * Use Index Server to prune(enable distributed pruning).
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. On Index Server, during getSplits() for main table, in the context of
> DistributedPruneRDD.internalCompute()(i.e., on Index server executors)
> within pruneIndexes() can do prune with new CoarseGrainIndex implementation
> for SI(similar to that of bloom). Inside Prune(), Identify the best
> suitable
> SI table, Use CarbonReader to read the SI table, get the position
> references
> to matching predicate.
> 3. Return the extended blocklets for main table
> 4. Need to check how to return/transform filter expression to append
> positionId filter with position references which are read from SI table
> from
> Index Server to Driver along with pruned blocklets??
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> *DisAdvantages:*
> 1. Index Server Executors memory would be occupied for SI table reading.
> 2. Concurrent queries may have impact as Index server is used for SI table
> reading.
> 3. Index Server must be running.
>
> We can introduce a new Carbon property to switch between present and the
> new
> approach being proposed. We may consider the secondary index table storage
> file format change later.
>
> Please let me know your opinion/suggestion if we can go with Option-1 or
> Option-2 or both Option 1 + 2 or any other suggestion ?
>
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
>
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/>