Apache CarbonData Dev Mailing List archive

[Discussion]Presto Queries leveraging Secondary Index

Classic

List

Threaded

10 messages Options

VenuReddy

[Discussion]Presto Queries leveraging Secondary Index

Hi all.!

At present Carbon table queries with Presto engine do not make use of
indexes(SI, Bloom etc) in query processing. Exploring feasible approaches
without query plan rewrite to make use of secondary indexes(if any
available) similar to that of existing datamap.

*
Option 1:
* Presto get splits for main table to find the suitable SI table, scan, get
the position references from SI table and return the splits for main table
accordingly.
Tentative Changes:

1. Make a new CoarseGrainIndex implementation for SI.
2. Within context of CarbondataSplitManager.getSplits() for main table, in
CarbonInputFormat.getPrunedBlocklets(), we can do prune with new
CoarseGrainIndex implementation for SI(similar to that of bloom). Inside
Prune(), Identify the best suitable SI table, Use SDK CarbonReader to scan
the identified SI table, get the position references to matching predicate.
Need to think of reading the table in multiple threads.
3. Modify the filter expression to append positionId filter with obtained
position references from SI table read.
4. In the context of CarbondataPageSource, create QueryModel with modified
filter expression.
Rest of the processing remains same as before.
*Advantages:*
1. Can avoid the query plan rewrite and yet make use of SI tables.
2. Can leverage SI with any execution engine.
*DisAdvantages:*
1. Reading SI table in the context of CarbondataSplitManager.getSplits() of
main table, possibly may degrade the query performance. Need to have enough
resource to spawn multiple threads for reading within it.

*
Option 2:
* Use Index Server to prune(enable distributed pruning).
Tentative Changes:

1. Make a new CoarseGrainIndex implementation for SI.
2. On Index Server, during getSplits() for main table, in the context of
DistributedPruneRDD.internalCompute()(i.e., on Index server executors)
within pruneIndexes() can do prune with new CoarseGrainIndex implementation
for SI(similar to that of bloom). Inside Prune(), Identify the best suitable
SI table, Use CarbonReader to read the SI table, get the position references
to matching predicate.
3. Return the extended blocklets for main table
4. Need to check how to return/transform filter expression to append
positionId filter with position references which are read from SI table from
Index Server to Driver along with pruned blocklets??
*Advantages:*
1. Can avoid the query plan rewrite and yet make use of SI tables.
*DisAdvantages:*
1. Index Server Executors memory would be occupied for SI table reading.
2. Concurrent queries may have impact as Index server is used for SI table
reading.
3. Index Server must be running.

We can introduce a new Carbon property to switch between present and the new
approach being proposed. We may consider the secondary index table storage
file format change later.

Please let me know your opinion/suggestion if we can go with Option-1 or
Option-2 or both Option 1 + 2 or any other suggestion ?

Thanks,
Venu Reddy

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Re: [Discussion]Presto Queries leveraging Secondary Index

Hi Venu,

a. *Presto carbondata support reading bloom index*, so I want to correct
your initial statement "Presto engine do not make use of
indexes(SI, Bloom etc) in query processing"

b. Between option1 and option2 the main difference is *option1 is
multi-threaded and option2 is distributed.*
The performance of the option1 will be bad. Hence even though we need spark
index server cluster (currently presto carbondata always need spark cluster
to write carbondata) *I want to go with option2.*

c. For option2, the implementation you cannot do like bloom as we need to
read the whole SI table with filter, so suggest to make a dataframe by
querying the SI table (which calls CarbonScanRDD) and once you get the
matched blocklets, make a split for main table from that based on block
level or blocklet level task distribution.

Thanks,
Ajantha

On Tue, Jan 5, 2021 at 5:31 PM VenuReddy <[hidden email]> wrote:

> Hi all.!
>
> At present Carbon table queries with Presto engine do not make use of
> indexes(SI, Bloom etc) in query processing. Exploring feasible approaches
> without query plan rewrite to make use of secondary indexes(if any
> available) similar to that of existing datamap.
>
> *
> Option 1:
> * Presto get splits for main table to find the suitable SI table, scan, get
> the position references from SI table and return the splits for main table
> accordingly.
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. Within context of CarbondataSplitManager.getSplits() for main table, in
> CarbonInputFormat.getPrunedBlocklets(), we can do prune with new
> CoarseGrainIndex implementation for SI(similar to that of bloom). Inside
> Prune(), Identify the best suitable SI table, Use SDK CarbonReader to scan
> the identified SI table, get the position references to matching predicate.
> Need to think of reading the table in multiple threads.
> 3. Modify the filter expression to append positionId filter with obtained
> position references from SI table read.
> 4. In the context of CarbondataPageSource, create QueryModel with modified
> filter expression.
> Rest of the processing remains same as before.
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> 2. Can leverage SI with any execution engine.
> *DisAdvantages:*
> 1. Reading SI table in the context of CarbondataSplitManager.getSplits() of
> main table, possibly may degrade the query performance. Need to have enough
> resource to spawn multiple threads for reading within it.
>
> *
> Option 2:
> * Use Index Server to prune(enable distributed pruning).
> Tentative Changes:
>
> 1. Make a new CoarseGrainIndex implementation for SI.
> 2. On Index Server, during getSplits() for main table, in the context of
> DistributedPruneRDD.internalCompute()(i.e., on Index server executors)
> within pruneIndexes() can do prune with new CoarseGrainIndex implementation
> for SI(similar to that of bloom). Inside Prune(), Identify the best
> suitable
> SI table, Use CarbonReader to read the SI table, get the position
> references
> to matching predicate.
> 3. Return the extended blocklets for main table
> 4. Need to check how to return/transform filter expression to append
> positionId filter with position references which are read from SI table
> from
> Index Server to Driver along with pruned blocklets??
> *Advantages:*
> 1. Can avoid the query plan rewrite and yet make use of SI tables.
> *DisAdvantages:*
> 1. Index Server Executors memory would be occupied for SI table reading.
> 2. Concurrent queries may have impact as Index server is used for SI table
> reading.
> 3. Index Server must be running.
>
> We can introduce a new Carbon property to switch between present and the
> new
> approach being proposed. We may consider the secondary index table storage
> file format change later.
>
> Please let me know your opinion/suggestion if we can go with Option-1 or
> Option-2 or both Option 1 + 2 or any other suggestion ?
>
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

David CaiQiang

Re: [Discussion]Presto Queries leveraging Secondary Index

hi Venu and Ajantha,

For the new SI solution, I have some suggestions also.
1. agree to avoid query plan rewrite
2. push down the SI filter to the pruning step of the main table directly on
the driver side, but we need a distributed job to improve performance
3. segment level usability
for example, when only one segment doesn't have indexes, but other 99
segments have indexes, SI should be used to improve the filter query of the
index column.
4. consider the filter column's selectivity, it should impact the priority
of the indexes (include main index).
phase 1: base on rules(filter order or hint)
phase 2: base on cost (statistics)

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Best Regards
David Cai

akashrn5

Re: [Discussion]Presto Queries leveraging Secondary Index

In reply to this post by VenuReddy

Hi venu,

Thanks for suggesting.

1. option 1 is not a good idea. i think performance will be bad
2. for option2, like we have other indexes of lucene and bloom where the
distributed pruning happens. Lucene also a index stored along with table,
but not another table like SI, so we scan lucene in a distributed job and
then return the index for the filter expression. So similarly we can call
for SI to scan and prune, but since we need spark job to do it, we need
indexserver which is the only option.
So we can use that for scanning, but im afraid if it impacts the other
concurrent queries, so i would suggest better to go for POC with the index
server where we will get to know some other bottlenecks with this approach,
so then we can decide and start design.

If you have already done POC and have some results and design is ready, we
can review that.

Thanks

Regards
Akash

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

kunalkapoor

Re: [Discussion]Presto Queries leveraging Secondary Index

+1 on using index server to leverage SI index. As discussed earlier we
would need a segment UDF to enable selective segment reading instead of the
current implementation. The existing setSegmentsToRead API should be
removed later as well

Please share the design after your POC

On Mon, Jan 18, 2021 at 9:42 AM akashrn5 <[hidden email]> wrote:

> Hi venu,
>
> Thanks for suggesting.
>
> 1. option 1 is not a good idea. i think performance will be bad
> 2. for option2, like we have other indexes of lucene and bloom where the
> distributed pruning happens. Lucene also a index stored along with table,
> but not another table like SI, so we scan lucene in a distributed job and
> then return the index for the filter expression. So similarly we can call
> for SI to scan and prune, but since we need spark job to do it, we need
> indexserver which is the only option.
> So we can use that for scanning, but im afraid if it impacts the other
> concurrent queries, so i would suggest better to go for POC with the index
> server where we will get to know some other bottlenecks with this approach,
> so then we can decide and start design.
>
> If you have already done POC and have some results and design is ready, we
> can review that.
>
> Thanks
>
> Regards
> Akash
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

VenuReddy

Re: [Discussion]Presto Queries leveraging Secondary Index

In reply to this post by VenuReddy

Hi all.!

As discussed in the community meeting held on last week of Feb 2021, we
already have plan to make Secondary Index as a Coarse Grain Datamap in the
future. And It would be more appropriate for this requirement to implement
Secondary Index as the CG Datamap. Presto query can leverage secondary index
in the pruning through the datamap interface. Spark queries can still
continue to make use of secondary indexes with existing approach of query
plan modification.

Have added the detailed design in the below doc.

https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing

Please review it and let me know your suggestions/inputs.

Thanks,
Venu Reddy

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

kunalkapoor

Re: [Discussion]Presto Queries leveraging Secondary Index

+1 for the design

On Tue, Mar 23, 2021 at 10:37 AM VenuReddy <[hidden email]>
wrote:

> Hi all.!
>
> As discussed in the community meeting held on last week of Feb 2021, we
> already have plan to make Secondary Index as a Coarse Grain Datamap in the
> future. And It would be more appropriate for this requirement to implement
> Secondary Index as the CG Datamap. Presto query can leverage secondary
> index
> in the pruning through the datamap interface. Spark queries can still
> continue to make use of secondary indexes with existing approach of query
> plan modification.
>
> Have added the detailed design in the below doc.
>
>
> https://docs.google.com/document/d/1VZlRYqydjzBXmZcFLQ4Ty-lK8RQlYVDoEfIId7vOaxk/edit?usp=sharing
>
> Please review it and let me know your suggestions/inputs.
>
> Thanks,
> Venu Reddy
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Indhumathi

Re: [Discussion]Presto Queries leveraging Secondary Index

In reply to this post by VenuReddy

+1 for design.

Please find my comments.

1. About updating IndexStatus.ENABLED property, Need to consider
compatibility scenarios as well.
2. Can update the query behavior when carbon.enable.distributed.index
and carbon.disable.index.server.fallback is enabled.

Regards,
Indhumathi M

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Re: [Discussion]Presto Queries leveraging Secondary Index

+1

Thanks,
Ajantha

On Mon, Mar 29, 2021 at 5:58 PM Indhumathi <[hidden email]> wrote:

> +1 for design.
>
> Please find my comments.
>
> 1. About updating IndexStatus.ENABLED property, Need to consider
> compatibility scenarios as well.
> 2. Can update the query behavior when carbon.enable.distributed.index
> and carbon.disable.index.server.fallback is enabled.
>
>
> Regards,
> Indhumathi M
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

akashrn5

Re: [Discussion]Presto Queries leveraging Secondary Index

In reply to this post by VenuReddy

Hi,

+1 for the feature and the design.

I have give some comments on the design doc for handling some missing
scenarios and small changes.
Can you please update the design doc. As not so major comments except one or
two, can go ahead with feature and parallelly can update comments.

Thanks

Regards,
Akash R

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/