Apache CarbonData Dev Mailing List archive

Re: Introducing V3 format.

Posted by ravipesala on Feb 15, 2017; 11:50am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Introducing-V3-format-tp7609p7610.html

Please find the thrift file in below location.
https://drive.google.com/open?id=0B4TWTVbFSTnqZEdDRHRncVItQ242b1NqSTU2b2g4dkhkVDRj

On 15 February 2017 at 17:14, Ravindra Pesala <[hidden email]> wrote:

> Problems in current format.
> 1. IO read is slower since it needs to go for multiple seeks on the file
> to read column blocklets. Current size of blocklet is 120000, so it needs
> to read multiple times from file to scan the data on that column.
> Alternatively we can increase the blocklet size but it suffers for filter
> queries as it gets big blocklet to filter.
> 2. Decompression is slower in current format, we are using inverted index
> for faster filter queries and using NumberCompressor to compress the
> inverted index in bit wise packing. It becomes slower so we should avoid
> number compressor. One alternative is to keep blocklet size with in 32000
> so that inverted index can be written with short, but IO read suffers a lot.
>
> To overcome from above 2 issues we are introducing new format V3.
> Here each blocklet has multiple pages with size 32000, number of pages in
> blocklet is configurable. Since we keep the page with in short limit so no
> need compress the inverted index here.
> And maintain the max/min for each page to further prune the filter queries.
> Read the blocklet with pages at once and keep in offheap memory.
> During filter first check the max/min range and if it is valid then go for
> decompressing the page to filter further.
>
> Please find the attached V3 format thrift file.
>
> --
> Thanks & Regards,
> Ravi
>

--
Thanks & Regards,
Ravi