[Discussion] Please vote and comment for carbon data file format change

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] Please vote and comment for carbon data file format change

kumarvishal09

​Hello All,
            
Improving carbon first time query performance

Reason:
1. As file system cache is cleared file reading will make it slower to read and cache
2. In first time query carbon will have to read the footer from file data file to form the btree
3. Carbon reading more footer data than its required(data chunk)
4. There are lots of random seek is happening in carbon as column data(data page, rle, inverted index) are not stored together.

Solution: 
1. Improve block loading time. This can be done by removing data chunk from blockletInfo and storing only offset and length of data chunk
2. compress presence meta bitset stored for null values for measure column using snappy 
3. Store the metadata and data of a column together and read together this reduces random seek and improve IO

For this I am planing to change the carbondata thrift format

Old format



New format




Please vote and comment for this new format change 

-Regards
Kumar Vishal



kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

hexiaoqiao
Hi Kumar Vishal,

I couldn't get Fig. of the file format, could you re-upload them?
Thanks.

Best Regards

On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email]>
wrote:

>
> ​Hello All,
>
> Improving carbon first time query performance
>
> Reason:
> 1. As file system cache is cleared file reading will make it slower to
> read and cache
> 2. In first time query carbon will have to read the footer from file data
> file to form the btree
> 3. Carbon reading more footer data than its required(data chunk)
> 4. There are lots of random seek is happening in carbon as column
> data(data page, rle, inverted index) are not stored together.
>
> Solution:
> 1. Improve block loading time. This can be done by removing data chunk
> from blockletInfo and storing only offset and length of data chunk
> 2. compress presence meta bitset stored for null values for measure column
> using snappy
> 3. Store the metadata and data of a column together and read together this
> reduces random seek and improve IO
>
> For this I am planing to change the carbondata thrift format
>
> *Old format*
>
>
>
> *New format*
>
>
>
> *​*
>
> Please vote and comment for this new format change
>
> -Regards
> Kumar Vishal
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

kumarvishal09
 Hi Xiaoqiao He,
      
Please find the attachment.

-Regards
Kumar Vishal

On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email]> wrote:
Hi Kumar Vishal,

I couldn't get Fig. of the file format, could you re-upload them?
Thanks.

Best Regards

On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email]>
wrote:

>
> ​Hello All,
>
> Improving carbon first time query performance
>
> Reason:
> 1. As file system cache is cleared file reading will make it slower to
> read and cache
> 2. In first time query carbon will have to read the footer from file data
> file to form the btree
> 3. Carbon reading more footer data than its required(data chunk)
> 4. There are lots of random seek is happening in carbon as column
> data(data page, rle, inverted index) are not stored together.
>
> Solution:
> 1. Improve block loading time. This can be done by removing data chunk
> from blockletInfo and storing only offset and length of data chunk
> 2. compress presence meta bitset stored for null values for measure column
> using snappy
> 3. Store the metadata and data of a column together and read together this
> reduces random seek and improve IO
>
> For this I am planing to change the carbondata thrift format
>
> *Old format*
>
>
>
> *New format*
>
>
>
> *​*
>
> Please vote and comment for this new format change
>
> -Regards
> Kumar Vishal
>
>
>
>

kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

Jacky Li
The proposed change is reasonable, +1.
But is there a plan to make the reader backward compatible with the old format? So the impact to the current deployment is minimum.

Regards,
Jacky

> 在 2016年11月2日,上午12:38,Kumar Vishal <[hidden email]> 写道:
>
>  Hi Xiaoqiao He,
>      
> Please find the attachment.
>
> -Regards
> Kumar Vishal
>
> On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email] <mailto:[hidden email]>> wrote:
> Hi Kumar Vishal,
>
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
>
> Best Regards
>
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email] <mailto:[hidden email]>>
> wrote:
>
> >
> > ​Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure column
> > using snappy
> > 3. Store the metadata and data of a column together and read together this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > *​*
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

kumarvishal09
Dear Jacky,
           Yes I am planning to support both data format reader(new and
old) + writer(new and old), default new writer will be enabled, but if user
wants to write in older format for that i will expose one configuration.
Please let me know if you have any other suggestion.

-Regards
Kumar Vishal

On Thu, Nov 3, 2016 at 8:24 PM, Jacky Li <[hidden email]> wrote:

> The proposed change is reasonable, +1.
> But is there a plan to make the reader backward compatible with the old
> format? So the impact to the current deployment is minimum.
>
> Regards,
> Jacky
>
> > 在 2016年11月2日,上午12:38,Kumar Vishal <[hidden email]> 写道:
> >
> >  Hi Xiaoqiao He,
> >
> > Please find the attachment.
> >
> > -Regards
> > Kumar Vishal
> >
> > On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email]
> <mailto:[hidden email]>> wrote:
> > Hi Kumar Vishal,
> >
> > I couldn't get Fig. of the file format, could you re-upload them?
> > Thanks.
> >
> > Best Regards
> >
> > On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email]
> <mailto:[hidden email]>>
> > wrote:
> >
> > >
> > > ​Hello All,
> > >
> > > Improving carbon first time query performance
> > >
> > > Reason:
> > > 1. As file system cache is cleared file reading will make it slower to
> > > read and cache
> > > 2. In first time query carbon will have to read the footer from file
> data
> > > file to form the btree
> > > 3. Carbon reading more footer data than its required(data chunk)
> > > 4. There are lots of random seek is happening in carbon as column
> > > data(data page, rle, inverted index) are not stored together.
> > >
> > > Solution:
> > > 1. Improve block loading time. This can be done by removing data chunk
> > > from blockletInfo and storing only offset and length of data chunk
> > > 2. compress presence meta bitset stored for null values for measure
> column
> > > using snappy
> > > 3. Store the metadata and data of a column together and read together
> this
> > > reduces random seek and improve IO
> > >
> > > For this I am planing to change the carbondata thrift format
> > >
> > > *Old format*
> > >
> > >
> > >
> > > *New format*
> > >
> > >
> > >
> > > *​*
> > >
> > > Please vote and comment for this new format change
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > >
> > >
> > >
> >
>
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

RE: [Discussion] Please vote and comment for carbon data file format change

Jihong Ma
In reply to this post by Jacky Li
Hi Kumar,

Please place the proposed format changes in attachment or attach to the associated JIRA, I would like to take a look.

Thanks!

Jihong

-----Original Message-----
From: Jacky Li [mailto:[hidden email]]
Sent: Thursday, November 03, 2016 7:54 AM
To: [hidden email]
Subject: Re: [Discussion] Please vote and comment for carbon data file format change

The proposed change is reasonable, +1.
But is there a plan to make the reader backward compatible with the old format? So the impact to the current deployment is minimum.

Regards,
Jacky

> 在 2016年11月2日,上午12:38,Kumar Vishal <[hidden email]> 写道:
>
>  Hi Xiaoqiao He,
>      
> Please find the attachment.
>
> -Regards
> Kumar Vishal
>
> On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email] <mailto:[hidden email]>> wrote:
> Hi Kumar Vishal,
>
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
>
> Best Regards
>
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email] <mailto:[hidden email]>>
> wrote:
>
> >
> > ​Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure column
> > using snappy
> > 3. Store the metadata and data of a column together and read together this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > *​*
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

kumarvishal09

Hi Jihong Ma,

Please find the attachment.

-Regards
Kumar Vishal

On Fri, Nov 4, 2016 at 12:16 AM, Jihong Ma <[hidden email]> wrote:
Hi Kumar,

Please place the proposed format changes in attachment or attach to the associated JIRA, I would like to take a look.

Thanks!

Jihong

-----Original Message-----
From: Jacky Li [mailto:[hidden email]]
Sent: Thursday, November 03, 2016 7:54 AM
To: [hidden email]
Subject: Re: [Discussion] Please vote and comment for carbon data file format change

The proposed change is reasonable, +1.
But is there a plan to make the reader backward compatible with the old format? So the impact to the current deployment is minimum.

Regards,
Jacky

> 在 2016年11月2日,上午12:38,Kumar Vishal <[hidden email]> 写道:
>
>  Hi Xiaoqiao He,
>
> Please find the attachment.
>
> -Regards
> Kumar Vishal
>
> On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email] <mailto:[hidden email]>> wrote:
> Hi Kumar Vishal,
>
> I couldn't get Fig. of the file format, could you re-upload them?
> Thanks.
>
> Best Regards
>
> On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email] <mailto:[hidden email]>>
> wrote:
>
> >
> > ​Hello All,
> >
> > Improving carbon first time query performance
> >
> > Reason:
> > 1. As file system cache is cleared file reading will make it slower to
> > read and cache
> > 2. In first time query carbon will have to read the footer from file data
> > file to form the btree
> > 3. Carbon reading more footer data than its required(data chunk)
> > 4. There are lots of random seek is happening in carbon as column
> > data(data page, rle, inverted index) are not stored together.
> >
> > Solution:
> > 1. Improve block loading time. This can be done by removing data chunk
> > from blockletInfo and storing only offset and length of data chunk
> > 2. compress presence meta bitset stored for null values for measure column
> > using snappy
> > 3. Store the metadata and data of a column together and read together this
> > reduces random seek and improve IO
> >
> > For this I am planing to change the carbondata thrift format
> >
> > *Old format*
> >
> >
> >
> > *New format*
> >
> >
> >
> > *​*
> >
> > Please vote and comment for this new format change
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
>


kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

kumarvishal09
Hi All,
Please find the JIRA issue which I have raised for above discussion.

https://issues.apache.org/jira/browse/CARBONDATA-458

-Regards
Kumar Vishal

On Tue, Nov 29, 2016 at 7:14 PM, Kumar Vishal <[hidden email]>
wrote:

> Hi Jihong Ma,
> Please find the attachment.
>
> -Regards
> Kumar Vishal
>
> On Fri, Nov 4, 2016 at 12:16 AM, Jihong Ma <[hidden email]> wrote:
>
>> Hi Kumar,
>>
>> Please place the proposed format changes in attachment or attach to the
>> associated JIRA, I would like to take a look.
>>
>> Thanks!
>>
>> Jihong
>>
>> -----Original Message-----
>> From: Jacky Li [mailto:[hidden email]]
>> Sent: Thursday, November 03, 2016 7:54 AM
>> To: [hidden email]
>> Subject: Re: [Discussion] Please vote and comment for carbon data file
>> format change
>>
>> The proposed change is reasonable, +1.
>> But is there a plan to make the reader backward compatible with the old
>> format? So the impact to the current deployment is minimum.
>>
>> Regards,
>> Jacky
>>
>> > 在 2016年11月2日,上午12:38,Kumar Vishal <[hidden email]> 写道:
>> >
>> >  Hi Xiaoqiao He,
>> >
>> > Please find the attachment.
>> >
>> > -Regards
>> > Kumar Vishal
>> >
>> > On Tue, Nov 1, 2016 at 9:27 PM, Xiaoqiao He <[hidden email]
>> <mailto:[hidden email]>> wrote:
>> > Hi Kumar Vishal,
>> >
>> > I couldn't get Fig. of the file format, could you re-upload them?
>> > Thanks.
>> >
>> > Best Regards
>> >
>> > On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal <[hidden email]
>> <mailto:[hidden email]>>
>> > wrote:
>> >
>> > >
>> > > ​Hello All,
>> > >
>> > > Improving carbon first time query performance
>> > >
>> > > Reason:
>> > > 1. As file system cache is cleared file reading will make it slower to
>> > > read and cache
>> > > 2. In first time query carbon will have to read the footer from file
>> data
>> > > file to form the btree
>> > > 3. Carbon reading more footer data than its required(data chunk)
>> > > 4. There are lots of random seek is happening in carbon as column
>> > > data(data page, rle, inverted index) are not stored together.
>> > >
>> > > Solution:
>> > > 1. Improve block loading time. This can be done by removing data chunk
>> > > from blockletInfo and storing only offset and length of data chunk
>> > > 2. compress presence meta bitset stored for null values for measure
>> column
>> > > using snappy
>> > > 3. Store the metadata and data of a column together and read together
>> this
>> > > reduces random seek and improve IO
>> > >
>> > > For this I am planing to change the carbondata thrift format
>> > >
>> > > *Old format*
>> > >
>> > >
>> > >
>> > > *New format*
>> > >
>> > >
>> > >
>> > > *​*
>> > >
>> > > Please vote and comment for this new format change
>> > >
>> > > -Regards
>> > > Kumar Vishal
>> > >
>> > >
>> > >
>> > >
>> >
>>
>>
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

jarray888
In reply to this post by kumarvishal09
+1 , currrent dataformat have first time query slow issue , should be fixed.
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

bill.zhou
In reply to this post by kumarvishal09
+1  this modification will help all the scenario
Kumar Vishal wrote
​Hello All,

Improving carbon first time query performance

Reason:
1. As file system cache is cleared file reading will make it slower to read
and cache
2. In first time query carbon will have to read the footer from file data
file to form the btree
3. Carbon reading more footer data than its required(data chunk)
4. There are lots of random seek is happening in carbon as column data(data
page, rle, inverted index) are not stored together.

Solution:
1. Improve block loading time. This can be done by removing data chunk from
blockletInfo and storing only offset and length of data chunk
2. compress presence meta bitset stored for null values for measure column
using snappy
3. Store the metadata and data of a column together and read together this
reduces random seek and improve IO

For this I am planing to change the carbondata thrift format

*Old format*



*New format*



*​*

Please vote and comment for this new format change

-Regards
Kumar Vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Please vote and comment for carbon data file format change

Jean-Baptiste Onofré
+1

Regards
JB⁣​

On Dec 10, 2016, 09:33, at 09:33, "bill.zhou" <[hidden email]> wrote:

>+1  this modification will help all the scenario
>
>Kumar Vishal wrote
>> ​Hello All,
>>
>> Improving carbon first time query performance
>>
>> Reason:
>> 1. As file system cache is cleared file reading will make it slower
>to
>> read
>> and cache
>> 2. In first time query carbon will have to read the footer from file
>data
>> file to form the btree
>> 3. Carbon reading more footer data than its required(data chunk)
>> 4. There are lots of random seek is happening in carbon as column
>> data(data
>> page, rle, inverted index) are not stored together.
>>
>> Solution:
>> 1. Improve block loading time. This can be done by removing data
>chunk
>> from
>> blockletInfo and storing only offset and length of data chunk
>> 2. compress presence meta bitset stored for null values for measure
>column
>> using snappy
>> 3. Store the metadata and data of a column together and read together
>this
>> reduces random seek and improve IO
>>
>> For this I am planing to change the carbondata thrift format
>>
>> *Old format*
>>
>>
>>
>> *New format*
>>
>>
>>
>> *​*
>>
>> Please vote and comment for this new format change
>>
>> -Regards
>> Kumar Vishal
>
>
>
>
>
>--
>View this message in context:
>http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Please-vote-and-comment-for-carbon-data-file-format-change-tp2491p4049.html
>Sent from the Apache CarbonData Mailing List archive mailing list
>archive at Nabble.com.