Re: [DISCUSSION] Page Level Bloom Filter

Posted by Manhua-2 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-Page-Level-Bloom-Filter-tp85720p87428.html

Hi Vishal,
   I want to ask a question. For supporting huge binary/varchar/complex data, the row number in a page will be larger or smaller than 32000?
  Thanks.

On 2019/11/26 11:49:54, Kumar Vishal <[hidden email]> wrote:

> Hi Manhua,
>
> I agree with Ravindra and Vimal adding page level Bloom will not improve
> query performance much, because it will not reduce amount of data read from
> the disk.
> It will only reduce some processing time(Uncompression of pages and
> Applying filter on those pages).
> Keeping the bloom information in file footer will help in reducing the
> IO+Processing.
>
> I agree with you finding the distinct for a column in blocklet will be
> complex because blocklet is based on size not based on rows.Page also we
> have size based configuration which is by default false now, but this we
> are planing to make true to support huge binary/varchar/complex data.
>
> Sol1. We can ask user to pass number of cardinality of column for which he
> wants to generate the bloom.
> Sol2. Once blocklet cut is done while writing carbondatafile we can
> calculate the cardinality of column for which he wants to generate the
> bloom. If size is more we can drop the bloom for that blockelt.
>
> I agree with Ravi keeping different FPP for executor and driver will help
> in reducing the size.
>
> -Regards
> Kumar Vishal
>
>
>
>
>
>
>
>
> On Tue, Nov 26, 2019 at 2:22 PM Manhua <[hidden email]> wrote:
>
> > Hi Vimal,
> >    For what you concern about, if you have tried bloom datamap, you may
> > know about how difficult it is to configure the bloom parameter. You never
> > know how many (distinct) elements will be added to the bloom filter because
> > blocklet is configure by size. The more bytes of a row is, the less numer
> > of row added in blocklet. And for block level, this will be related to
> > block size configuration too. Also, please mind the size of bloom filter.
> >
> >
> > On 2019/11/26 08:24:33, Vimal Das Kammath <[hidden email]>
> > wrote:
> > > I agree with ravindra that having bloom filter at Page level would not
> > save
> > > any IO. Having bloom filter at file level makes sense as it could help to
> > > prune files at the driver side. But, I am concerned on the number of
> > false
> > > positives that would result if we keep bloom filter at an entire file
> > > level. I think we need to experiment to find out the ideal
> > parameters(Bloom
> > > size and number of hash functions) that would work effectively for a file
> > > level bloom filter.
> > >
> > > Regards,
> > > Vimal
> > >
> > > On Tue, Nov 26, 2019 at 12:30 PM ravipesala <[hidden email]>
> > wrote:
> > >
> > > > Hi Manhua,
> > > >
> > > > Main problem with this approach is we cannot save any IO as our IO
> > unit is
> > > > blocklet not page. Once it is already to memory I really don’t think
> > we can
> > > > get performance with bloom at page level. I feel the solution would be
> > > > efficient only the IO is saved somewhere.
> > > >
> > > > Our min/max index is efficient because it can prune the files at driver
> > > > side
> > > > and prune the blocklets and pages at the executor side. It is actually
> > > > saving lots of IO.
> > > >
> > > > Supporting bloom at carbondata file and index level is a good approach
> > > > rather than just supporting at page level. My intention is that it
> > should
> > > > behave just the same as the min/max index. So that we can prune the
> > data at
> > > > multiple levels.
> > > >
> > > > The driver side at the block level we can have a bloom with less
> > > > probability
> > > > percentage and fewer hash functions to control the size as we load it
> > to
> > > > the
> > > > memory. And in the blocklet level we can increase the probability and
> > > > hashes
> > > > little more for better pruning, gradually at page level we can
> > increase the
> > > > probability further to have a much better pruning ability.
> > > >
> > > >
> > > > Regards,
> > > > Ravindra.
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > >
> > >
> >
>