Login  Register

Re: Questions and Concerns on DataMap API

Posted by sounak on Oct 04, 2017; 11:02am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Questions-and-Concerns-on-DataMap-API-tp23504p23505.html

Hi Dong,

Embedded are the answers.

> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
      In Case you are developing your own dataMap for spatial data then you can override the pruning logic and write your own. Your secondary index can store the spatial data and its corresponding blocklet during write phase of your dataMap, later while pruning filter out the spatial data and its corresponding BlockletID. This blockletId will be feed to BlockletDataMap (which is a default dataMap) to retrieve the Detailed blocklet information.
> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
      Yes, most commercial database optimizer chooses the best access path (when multiple indexes are present) based on their coverage and stats. Those indexes are preferred which covers the projection and predicated completely. But this feature is not there in cardondata and will be good to have.
> - I feel confused on the semantics of  `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
     Min Max example is only applying the dataMap in the Driver side not in executors. This is placed just as an example. In case you want your dataMap to be distributed and executed in the executers then it can be distributed.
> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
    I will be updating another pull request shortly which will scan data from the FactFile i.e. carbondata file and updates DataMap secondary Index. I will share the PR with you shortly.  


Thanks
Sounak


> On 03-Oct-2017, at 10:42 PM, Dong Xie <[hidden email]> wrote:
>
> Hi Carbon Team,
>
> Recently, I am considering working on implementing a secondary index over the DataMap API. After a careful look on the design, there are some questions and concerns I want to raise here:
>
> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
> - I feel confused on the semantics of  `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
>
>
> Thanks,
> Dong