Re: [DISCUSSION] Page Level Bloom Filter

Posted by kumarvishal09 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Page-Level-Bloom-Filter-tp85720p87427.html

Hi Manhua,

I agree with Ravindra and Vimal adding page level Bloom will not improve
query performance much, because it will not reduce amount of data read from
the disk.
It will only reduce some processing time(Uncompression of pages and
Applying filter on those pages).
Keeping the bloom information in file footer will help in reducing the
IO+Processing.

I agree with you finding the distinct for a column in blocklet will be
complex because blocklet is based on size not based on rows.Page also we
have size based configuration which is by default false now, but this we
are planing to make true to support huge binary/varchar/complex data.

Sol1. We can ask user to pass number of cardinality of column for which he
wants to generate the bloom.
Sol2. Once blocklet cut is done while writing carbondatafile we can
calculate the cardinality of column for which he wants to generate the
bloom. If size is more we can drop the bloom for that blockelt.

I agree with Ravi keeping different FPP for executor and driver will help
in reducing the size.

-Regards
Kumar Vishal








On Tue, Nov 26, 2019 at 2:22 PM Manhua <[hidden email]> wrote:

> Hi Vimal,
>    For what you concern about, if you have tried bloom datamap, you may
> know about how difficult it is to configure the bloom parameter. You never
> know how many (distinct) elements will be added to the bloom filter because
> blocklet is configure by size. The more bytes of a row is, the less numer
> of row added in blocklet. And for block level, this will be related to
> block size configuration too. Also, please mind the size of bloom filter.
>
>
> On 2019/11/26 08:24:33, Vimal Das Kammath <[hidden email]>
> wrote:
> > I agree with ravindra that having bloom filter at Page level would not
> save
> > any IO. Having bloom filter at file level makes sense as it could help to
> > prune files at the driver side. But, I am concerned on the number of
> false
> > positives that would result if we keep bloom filter at an entire file
> > level. I think we need to experiment to find out the ideal
> parameters(Bloom
> > size and number of hash functions) that would work effectively for a file
> > level bloom filter.
> >
> > Regards,
> > Vimal
> >
> > On Tue, Nov 26, 2019 at 12:30 PM ravipesala <[hidden email]>
> wrote:
> >
> > > Hi Manhua,
> > >
> > > Main problem with this approach is we cannot save any IO as our IO
> unit is
> > > blocklet not page. Once it is already to memory I really don’t think
> we can
> > > get performance with bloom at page level. I feel the solution would be
> > > efficient only the IO is saved somewhere.
> > >
> > > Our min/max index is efficient because it can prune the files at driver
> > > side
> > > and prune the blocklets and pages at the executor side. It is actually
> > > saving lots of IO.
> > >
> > > Supporting bloom at carbondata file and index level is a good approach
> > > rather than just supporting at page level. My intention is that it
> should
> > > behave just the same as the min/max index. So that we can prune the
> data at
> > > multiple levels.
> > >
> > > The driver side at the block level we can have a bloom with less
> > > probability
> > > percentage and fewer hash functions to control the size as we load it
> to
> > > the
> > > memory. And in the blocklet level we can increase the probability and
> > > hashes
> > > little more for better pruning, gradually at page level we can
> increase the
> > > probability further to have a much better pruning ability.
> > >
> > >
> > > Regards,
> > > Ravindra.
> > >
> > >
> > >
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>
kumar vishal