Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION] Page Level Bloom Filter

Posted by Manhua-2 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Page-Level-Bloom-Filter-tp85720p87417.html

Hi Ravindra,

Yes, you're right. I did test this morning with nmon and found that IO did not decrease but CPU. The time gap between with and without page bloom could be from decompressing and decoding the column page, and number of rows spark needs to process.

For different level bloom, we can do `OR` operation on bitmaps of pages to get blocklet level, and similarly to get block level.

On 2019/11/26 07:14:40, ravipesala <[hidden email]> wrote:

> Hi Manhua,
>
> Main problem with this approach is we cannot save any IO as our IO unit is
> blocklet not page. Once it is already to memory I really don’t think we can
> get performance with bloom at page level. I feel the solution would be
> efficient only the IO is saved somewhere.
>
> Our min/max index is efficient because it can prune the files at driver side
> and prune the blocklets and pages at the executor side. It is actually
> saving lots of IO.
>
> Supporting bloom at carbondata file and index level is a good approach
> rather than just supporting at page level. My intention is that it should
> behave just the same as the min/max index. So that we can prune the data at
> multiple levels.
>
> The driver side at the block level we can have a bloom with less probability
> percentage and fewer hash functions to control the size as we load it to the
> memory. And in the blocklet level we can increase the probability and hashes
> little more for better pruning, gradually at page level we can increase the
> probability further to have a much better pruning ability.
>
>
> Regards,
> Ravindra.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>