Apache CarbonData Dev Mailing List archive - Re: [Discussion] Please vote and comment for carbon data file format change

Apache CarbonData Dev Mailing List archive

Re: [Discussion] Please vote and comment for carbon data file format change

Posted by bill.zhou on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Please-vote-and-comment-for-carbon-data-file-format-change-tp2491p4049.html

+1 this modification will help all the scenario

Kumar Vishal wrote

Hello All,

Improving carbon first time query performance

Reason:
1. As file system cache is cleared file reading will make it slower to read
and cache
2. In first time query carbon will have to read the footer from file data
file to form the btree
3. Carbon reading more footer data than its required(data chunk)
4. There are lots of random seek is happening in carbon as column data(data
page, rle, inverted index) are not stored together.

Solution:
1. Improve block loading time. This can be done by removing data chunk from
blockletInfo and storing only offset and length of data chunk
2. compress presence meta bitset stored for null values for measure column
using snappy
3. Store the metadata and data of a column together and read together this
reduces random seek and improve IO

For this I am planing to change the carbondata thrift format

*Old format*

*New format*

**

Please vote and comment for this new format change

-Regards
Kumar Vishal