http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Introducing-V3-format-tp7609p7613.html
This will improve the IO bottleneck. Page level min max will improve the
filter query performance. Separating uncompression of data from reader
layer will improve the overall query performance.
> Please find the thrift file in below location.
>
https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b> 1NqSTU2b2g4dkhkVDRj
>
> On 15 February 2017 at 17:14, Ravindra Pesala <
[hidden email]>
> wrote:
>
> > Problems in current format.
> > 1. IO read is slower since it needs to go for multiple seeks on the file
> > to read column blocklets. Current size of blocklet is 120000, so it needs
> > to read multiple times from file to scan the data on that column.
> > Alternatively we can increase the blocklet size but it suffers for filter
> > queries as it gets big blocklet to filter.
> > 2. Decompression is slower in current format, we are using inverted index
> > for faster filter queries and using NumberCompressor to compress the
> > inverted index in bit wise packing. It becomes slower so we should avoid
> > number compressor. One alternative is to keep blocklet size with in 32000
> > so that inverted index can be written with short, but IO read suffers a
> lot.
> >
> > To overcome from above 2 issues we are introducing new format V3.
> > Here each blocklet has multiple pages with size 32000, number of pages in
> > blocklet is configurable. Since we keep the page with in short limit so
> no
> > need compress the inverted index here.
> > And maintain the max/min for each page to further prune the filter
> queries.
> > Read the blocklet with pages at once and keep in offheap memory.
> > During filter first check the max/min range and if it is valid then go
> for
> > decompressing the page to filter further.
> >
> > Please find the attached V3 format thrift file.
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>