questions about carbondata

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

questions about carbondata

weijie
1,what's the relation ship between these term?
 carbondata file ,block, blocklet ,carbondata file footer ? once we have a
batch job to load data into a carbondata table, does that mean the table
file will be composed by different blocks ,and each block is a carbondata
file  which is composed by many blocklets ,and one FileFooter  according to
the carbondata file format ?

2, how does the column data store as inverted index?
 invert the dim column data to what ? how does inverted index affect a
query ?

3. does all the blocklets store sequence according to the sorted mdk  key ?

hope someone can give a detail answer.
Reply | Threaded
Open this post in threaded view
|

回复:questions about carbondata

杰
hi,
1. correct.
   one carbon file is same as one block, one block has many blocklets as well as one file footer which has metadata(btree index) of blocklets.
   one load makes one segment,in one segment has many blocks.
2. carbon will sort dim column data in one blocklet,  then the row sequence will lost, so carbon will store  dim column data as will as row id together,
   and dim column data sorted and row id sequence changed correspondingly , so the matchup(like Array: index => data) is kept.
   when query, carbon will first get  the expected dim column data (based on filter), then accorfing to matchup to get row id.
   then based on the row id, we can get measure data.
   so the column data is called as inverted index, which means data => index, not index => data.
3. yes.




------------------ 原始邮件 ------------------
发件人: "weijie tong";<[hidden email]>;
发送时间: 2016年10月21日(星期五) 下午4:01
收件人: "dev"<[hidden email]>;

主题: questions about carbondata



1,what's the relation ship between these term?
 carbondata file ,block, blocklet ,carbondata file footer ? once we have a
batch job to load data into a carbondata table, does that mean the table
file will be composed by different blocks ,and each block is a carbondata
file  which is composed by many blocklets ,and one FileFooter  according to
the carbondata file format ?

2, how does the column data store as inverted index?
 invert the dim column data to what ? how does inverted index affect a
query ?

3. does all the blocklets store sequence according to the sorted mdk  key ?

hope someone can give a detail answer.
Reply | Threaded
Open this post in threaded view
|

Re: questions about carbondata

weijie
tks for the reply, for 3,I still want to know that whether all the  blocklets
of all the blocks store sequence according to the sorted mdk  key? if so ,
the global sequence mdk key of the carbon table would behave like what
hbase rowkey does . or the sequence is block local ,the block index file
manage the block level index?

On Fri, Oct 21, 2016 at 5:48 PM, 杰 <[hidden email]> wrote:

> hi,
> 1. correct.
>    one carbon file is same as one block, one block has many blocklets as
> well as one file footer which has metadata(btree index) of blocklets.
>    one load makes one segment,in one segment has many blocks.
> 2. carbon will sort dim column data in one blocklet,  then the row
> sequence will lost, so carbon will store  dim column data as will as row id
> together,
>    and dim column data sorted and row id sequence changed correspondingly
> , so the matchup(like Array: index => data) is kept.
>    when query, carbon will first get  the expected dim column data (based
> on filter), then accorfing to matchup to get row id.
>    then based on the row id, we can get measure data.
>    so the column data is called as inverted index, which means data =>
> index, not index => data.
> 3. yes.
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "weijie tong";<[hidden email]>;
> 发送时间: 2016年10月21日(星期五) 下午4:01
> 收件人: "dev"<[hidden email]>;
>
> 主题: questions about carbondata
>
>
>
> 1,what's the relation ship between these term?
>  carbondata file ,block, blocklet ,carbondata file footer ? once we have a
> batch job to load data into a carbondata table, does that mean the table
> file will be composed by different blocks ,and each block is a carbondata
> file  which is composed by many blocklets ,and one FileFooter  according to
> the carbondata file format ?
>
> 2, how does the column data store as inverted index?
>  invert the dim column data to what ? how does inverted index affect a
> query ?
>
> 3. does all the blocklets store sequence according to the sorted mdk  key ?
>
> hope someone can give a detail answer.
>
Reply | Threaded
Open this post in threaded view
|

回复: questions about carbondata

杰
hi,  


   for 3, blocklets are not stored sequence in global, neither in block local.
actually, we can say that blocklets are sorted in partition, and one partition has
many blocks. this word 'partition' is just exactly spark's partition, because
carbon makes further process in spark executor, so that one spark partition will
have many carbon blocks. though carbon's mkdkey is not sorted in global, while carbon dictionary is global,
so global dictionary + sorted in partition should make carbon not much difference with Hbase.
   as for index file, carbondataindex file contains blocks index info, while the footer in carbondata file contain blocklets index info,
that's  2 level for driver filter and executor filter.
 
Thanks
Jay


------------------ 原始邮件 ------------------
发件人: "weijie tong";<[hidden email]>;
发送时间: 2016年10月22日(星期六) 中午12:30
收件人: "dev"<[hidden email]>;

主题: Re: questions about carbondata



tks for the reply, for 3,I still want to know that whether all the  blocklets
of all the blocks store sequence according to the sorted mdk  key? if so ,
the global sequence mdk key of the carbon table would behave like what
hbase rowkey does . or the sequence is block local ,the block index file
manage the block level index?

On Fri, Oct 21, 2016 at 5:48 PM, 杰 <[hidden email]> wrote:

> hi,
> 1. correct.
>    one carbon file is same as one block, one block has many blocklets as
> well as one file footer which has metadata(btree index) of blocklets.
>    one load makes one segment,in one segment has many blocks.
> 2. carbon will sort dim column data in one blocklet,  then the row
> sequence will lost, so carbon will store  dim column data as will as row id
> together,
>    and dim column data sorted and row id sequence changed correspondingly
> , so the matchup(like Array: index => data) is kept.
>    when query, carbon will first get  the expected dim column data (based
> on filter), then accorfing to matchup to get row id.
>    then based on the row id, we can get measure data.
>    so the column data is called as inverted index, which means data =>
> index, not index => data.
> 3. yes.
>
>
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "weijie tong";<[hidden email]>;
> 发送时间: 2016年10月21日(星期五) 下午4:01
> 收件人: "dev"<[hidden email]>;
>
> 主题: questions about carbondata
>
>
>
> 1,what's the relation ship between these term?
>  carbondata file ,block, blocklet ,carbondata file footer ? once we have a
> batch job to load data into a carbondata table, does that mean the table
> file will be composed by different blocks ,and each block is a carbondata
> file  which is composed by many blocklets ,and one FileFooter  according to
> the carbondata file format ?
>
> 2, how does the column data store as inverted index?
>  invert the dim column data to what ? how does inverted index affect a
> query ?
>
> 3. does all the blocklets store sequence according to the sorted mdk  key ?
>
> hope someone can give a detail answer.
>