Apache CarbonData Dev Mailing List archive - [GitHub] [carbondata] VenuReddy2103 opened a new pull request #4110: [WIP]Secondary Index as a coarse grain datamap

Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] VenuReddy2103 opened a new pull request #4110: [WIP]Secondary Index as a coarse grain datamap

Posted by GitBox on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/GitHub-carbondata-VenuReddy2103-opened-a-new-pull-request-4110-WIP-Secondary-Index-as-a-coarse-grainp-tp107076.html

VenuReddy2103 opened a new pull request #4110:
URL: https://github.com/apache/carbondata/pull/4110

### Why is this PR needed?
At present, secondary indexes are leveraged for query pruning via spark plan modification. This approach is tightly coupled with spark because the plan modification is specific to spark engine. In order to use secondary indexes for Presto or Hive queries, it is not feasible to modify the query plans as we desire in the current approach. Thus need arises for an engine agnostic approach to use secondary indexes in query pruning.

### What changes were proposed in this PR?
1. Added Secondary Index datamap as a coarse grain datamap
2. Secondary Index datamap prune fires the spark sql query on the identified secondary index table within the particular segment to get the position references for the matching filters of datamap and in turn forms the blocklets. Note: Spark sql query is fired to take the advantage of spark distributed computing to filter and read the secondary index table in the distributed manner. Since the secondary index datamap fires the spark sql, it is prerequisite to enable distributed pruning and the Index Server must be up and running.
3. Have added a CarbonInputFormat level property to control the use of newly added secondary index datamap or not in query pruning. This property is set only when query is triggered from Presto. So, Secondary index datamap is used only for Presto queries. And queries from spark continue to use the existing approach of plan modification at optimizer/execution phases.
4. Upon Index Server get splits, if secondary index prune is applicable, prune and get extended blocklets directly on the index server driver instead of using existing DistributedPruneRDD which prunes on index server executors. This is because secondary index datamap pruning essentially fires a spark sql query and it require spark session/context.

### Does this PR introduce any user interface change?
- No
- Yes. (please explain the change and update document)

### Is any new testcase added?
- No
- Yes

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]