Apache CarbonData Dev Mailing List archive

Questions and Concerns on DataMap API

Classic

List

Threaded

4 messages Options

skyprophet

Oct 03, 2017; 5:12pm

Questions and Concerns on DataMap API

Hi Carbon Team,

Recently, I am considering working on implementing a secondary index over the DataMap API. After a careful look on the design, there are some questions and concerns I want to raise here:

- Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
- It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
- I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
- Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).

Thanks,
Dong

sounak

Oct 04, 2017; 11:02am

Re: Questions and Concerns on DataMap API

Hi Dong,

Embedded are the answers.

> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
In Case you are developing your own dataMap for spatial data then you can override the pruning logic and write your own. Your secondary index can store the spatial data and its corresponding blocklet during write phase of your dataMap, later while pruning filter out the spatial data and its corresponding BlockletID. This blockletId will be feed to BlockletDataMap (which is a default dataMap) to retrieve the Detailed blocklet information.
> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
Yes, most commercial database optimizer chooses the best access path (when multiple indexes are present) based on their coverage and stats. Those indexes are preferred which covers the projection and predicated completely. But this feature is not there in cardondata and will be good to have.
> - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
Min Max example is only applying the dataMap in the Driver side not in executors. This is placed just as an example. In case you want your dataMap to be distributed and executed in the executers then it can be distributed.
> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
I will be updating another pull request shortly which will scan data from the FactFile i.e. carbondata file and updates DataMap secondary Index. I will share the PR with you shortly.

Thanks
Sounak

> On 03-Oct-2017, at 10:42 PM, Dong Xie <[hidden email]> wrote:
>
> Hi Carbon Team,
>
> Recently, I am considering working on implementing a secondary index over the DataMap API. After a careful look on the design, there are some questions and concerns I want to raise here:
>
> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
> - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
>
>
> Thanks,
> Dong

... [show rest of quote]

skyprophet

Oct 04, 2017; 2:43pm

Re: Questions and Concerns on DataMap API

Hi Sounak,

The biggest problem is not about how to implement my own DataMap but on the API design itself. According to the current API, `prune` function will provide a list of `Blocklet` rather than tuples. Note that the concept of Blocket is closer to partition in Spark. As a result, `DataMap` API provides a functionality closer to global index (i.e., partition level pruning) rather than local index (i.e., tuple level pruning). `toDistribute` won't save us because it only turn the pruning procedure to a distributed one. Once these blocklets are fed into BlocktletDataMap, it still has the similar procedure to do partition level pruning. Of course, I can just do a global index over all the blocklets. However, note that spatial data is very sensitive to locality, as a result, global index on unsorted data has very limited power. As a secondary index, it should be constructed as local index which routes the pruning procedure to pointers of individual tuples.

Dong
On 10/4/2017 05:02:48, Sounak Chakraborty <[hidden email]> wrote:
Hi Dong,

Embedded are the answers.

> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
In Case you are developing your own dataMap for spatial data then you can override the pruning logic and write your own. Your secondary index can store the spatial data and its corresponding blocklet during write phase of your dataMap, later while pruning filter out the spatial data and its corresponding BlockletID. This blockletId will be feed to BlockletDataMap (which is a default dataMap) to retrieve the Detailed blocklet information.
> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
Yes, most commercial database optimizer chooses the best access path (when multiple indexes are present) based on their coverage and stats. Those indexes are preferred which covers the projection and predicated completely. But this feature is not there in cardondata and will be good to have.
> - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
Min Max example is only applying the dataMap in the Driver side not in executors. This is placed just as an example. In case you want your dataMap to be distributed and executed in the executers then it can be distributed.
> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
I will be updating another pull request shortly which will scan data from the FactFile i.e. carbondata file and updates DataMap secondary Index. I will share the PR with you shortly.

Thanks
Sounak

> On 03-Oct-2017, at 10:42 PM, Dong Xie wrote:
>
> Hi Carbon Team,
>
> Recently, I am considering working on implementing a secondary index over the DataMap API. After a careful look on the design, there are some questions and concerns I want to raise here:
>
> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
> - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
>
>
> Thanks,
> Dong

... [show rest of quote]

sounak

Oct 05, 2017; 4:33am

Re: Questions and Concerns on DataMap API

Hi Dong,

In case the requirement is fine-grained DataMap, then it is in the future roadmap and coming soon, where you can save the RowId which will be helpful for your spatial data. In between you can start your implementation on Blocklet level pruning (coarse-grained) and later use the fine-grained DataMap.

Thanks
Sounak

> On 04-Oct-2017, at 8:13 PM, Dong Xie <[hidden email]> wrote:
>
> Hi Sounak,
>
> The biggest problem is not about how to implement my own DataMap but on the API design itself. According to the current API, `prune` function will provide a list of `Blocklet` rather than tuples. Note that the concept of Blocket is closer to partition in Spark. As a result, `DataMap` API provides a functionality closer to global index (i.e., partition level pruning) rather than local index (i.e., tuple level pruning). `toDistribute` won't save us because it only turn the pruning procedure to a distributed one. Once these blocklets are fed into BlocktletDataMap, it still has the similar procedure to do partition level pruning. Of course, I can just do a global index over all the blocklets. However, note that spatial data is very sensitive to locality, as a result, global index on unsorted data has very limited power. As a secondary index, it should be constructed as local index which routes the pruning procedure to pointers of individual tuples.
>
> Dong
> On 10/4/2017 05:02:48, Sounak Chakraborty <[hidden email] <mailto:[hidden email]>> wrote:
> Hi Dong,
>
> Embedded are the answers.
>
>> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
> In Case you are developing your own dataMap for spatial data then you can override the pruning logic and write your own. Your secondary index can store the spatial data and its corresponding blocklet during write phase of your dataMap, later while pruning filter out the spatial data and its corresponding BlockletID. This blockletId will be feed to BlockletDataMap (which is a default dataMap) to retrieve the Detailed blocklet information.
>> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
> Yes, most commercial database optimizer chooses the best access path (when multiple indexes are present) based on their coverage and stats. Those indexes are preferred which covers the projection and predicated completely. But this feature is not there in cardondata and will be good to have.
>> - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
> Min Max example is only applying the dataMap in the Driver side not in executors. This is placed just as an example. In case you want your dataMap to be distributed and executed in the executers then it can be distributed.
>> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
> I will be updating another pull request shortly which will scan data from the FactFile i.e. carbondata file and updates DataMap secondary Index. I will share the PR with you shortly.
>
>
> Thanks
> Sounak
>
>
>> On 03-Oct-2017, at 10:42 PM, Dong Xie wrote:
>>
>> Hi Carbon Team,
>>
>> Recently, I am considering working on implementing a secondary index over the DataMap API. After a careful look on the design, there are some questions and concerns I want to raise here:
>>
>> - Datamap only support partition level (more precisely blocklet level pruning), Specifically, the `prune()` function in DataMap interface will consume a filter and produce a list of Blocklets. Then, it seems that building sophisticated data structure may be not useful. For example, in the case of spatial range query, the only thing I may want to know is the boundary of a blocklet, anything other insight will not be exposed to the pruning procedure.
>> - It is common in most commercial databases that only one index will be used for the filter process even though we can use other secondary index to prune. Most likely, the query optimizer will choose the index which provides the highest selectivity to use.
>> - I feel confused on the semantics of `toDistribute()` function in DataMap API. One problem I found in the MinMax DataMap example is that there will be a single thread consume all these indexes then construct the pruning. As a result, we may loose any advantage of massive parallelism. Is `distributed datamap` supposed to solve this problem?
>> - Finally, could you give me an example on iterating through all rows in a blocklet, block and segment so that I can get my input for index bulk loading. In one of the DataMap example which is still in a pull request (https://github.com/apache/carbondata/pull/1359 <https://github.com/apache/carbondata/pull/1359>), I cannot find this part since it pull the statistics directly from Blocklet built in MinMax Index. (Please refer to `loadBlockDetails` and `constructMinMaxIndex` function in `MinMaxDataWriter.java` under that PR).
>>
>>
>> Thanks,
>> Dong

... [show rest of quote]