Apache CarbonData Dev Mailing List archive - Re: [Discussion] Please vote and comment for carbon data file format change

Apache CarbonData Dev Mailing List archive

Re: [Discussion] Please vote and comment for carbon data file format change

Posted by kumarvishal09 on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Please-vote-and-comment-for-carbon-data-file-format-change-tp2491p3361.html

Hi Jihong Ma,

Please find the attachment.

-Regards

Kumar Vishal

On Fri, Nov 4, 2016 at 12:16 AM, Jihong Ma <[hidden email]> wrote:

Hi Kumar,

Please place the proposed format changes in attachment or attach to the associated JIRA, I would like to take a look.

Thanks!

Jihong

-----Original Message-----
From: Jacky Li [mailto:[hidden email]]
Sent: Thursday, November 03, 2016 7:54 AM
To: [hidden email]
Subject: Re: [Discussion] Please vote and comment for carbon data file format change

The proposed change is reasonable, +1.
But is there a plan to make the reader backward compatible with the old format? So the impact to the current deployment is minimum.

Regards,
Jacky

> 在 2016年11月2日，上午12:38，Kumar Vishal <[hidden email]> 写道：
>
> Hi Xiaoqiao He,
>
> Please find the attachment.
>
> -Regards
> Kumar Vishal
>
> On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email] <mailto:[hidden email]>> wrote:
> Hi Kumar Vishal,
>
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
>
> Best Regards
>
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email] <mailto:[hidden email]>>
> wrote:
>
> >
> > Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure column
> > using snappy
> > 3. Store the metadata and data of a column together and read together this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > **
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
>

kumar vishal