Apache CarbonData Dev Mailing List archive

[New Feature] Adding bucketed table feature to Carbondata

Classic

List

Threaded

3 messages Options

ravipesala

[New Feature] Adding bucketed table feature to Carbondata

Hi All,

Bucketing concept is based on the hash partition the bucketed column as per
configured bucket numbers. Records with same bucketed column always goes to
the same same bucket. Physically each bucket is a file/files in table
directory.
Advantages
Bucketed table is useful feature to do the map side joins and avoids
shuffling of data.
Carbondata can do driver level pruning on bucketed column to improve query
performance.

User can add bucketed table as follows

CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
CLUSTERED BY(user_id) INTO 32 BUCKETS;

In the above example column user_id is hash partitioned and creates 32
buckets/partitions files in carbondata. So while doing the join with other
table on bucketed column it can select same buckets and do the join with
out shuffling.

Carbon creates following folder structure currently, since carbon is
already supporting partitioning in its file format

dbName -> tableName - > Fact ->

Part0 ->Segment_id ->
carbondatafiles

Part1 ->Segment_id ->
carbondatafiles

we can also move the partitionid to file metadata.But if we move the
partitionId to metadata then there would be complications in backward
compatibility.
--
Thanks & Regards,
Ravindra

sraghunandan

Re: [New Feature] Adding bucketed table feature to Carbondata

How is this different from partitioning?
On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala <[hidden email]>
wrote:

> Hi All,
>
> Bucketing concept is based on the hash partition the bucketed column as per
> configured bucket numbers. Records with same bucketed column always goes to
> the same same bucket. Physically each bucket is a file/files in table
> directory.
> Advantages
> Bucketed table is useful feature to do the map side joins and avoids
> shuffling of data.
> Carbondata can do driver level pruning on bucketed column to improve query
> performance.
>
> User can add bucketed table as follows
>
> CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
> CLUSTERED BY(user_id) INTO 32 BUCKETS;
>
> In the above example column user_id is hash partitioned and creates 32
> buckets/partitions files in carbondata. So while doing the join with other
> table on bucketed column it can select same buckets and do the join with
> out shuffling.
>
> Carbon creates following folder structure currently, since carbon is
> already supporting partitioning in its file format
>
> dbName -> tableName - > Fact ->
>
> Part0 ->Segment_id ->
> carbondatafiles
>
> Part1 ->Segment_id ->
> carbondatafiles
>
> we can also move the partitionid to file metadata.But if we move the
> partitionId to metadata then there would be complications in backward
> compatibility.
> --
> Thanks & Regards,
> Ravindra
>

ravipesala

Re: [New Feature] Adding bucketed table feature to Carbondata

Hi Raghu,

In Hive's or Spark's terminology Partitioning and bucketing are different.
Partitioning divides the large amount of data into number pieces of folders
based on table columns value.Here the number partitions created is
depending upon the cardinality of that partitioned column. So it is very
in-effective if cardinality is higher.

In other hand Bucketing can divide the data into equal parts(user
configurable number) depends on hashing of that column. It is useful for
high cardinality columns.

Regards,
Ravindra

On 27 November 2016 at 23:24, Raghunandan S <
[hidden email]> wrote:

> How is this different from partitioning?
> On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > Bucketing concept is based on the hash partition the bucketed column as
> per
> > configured bucket numbers. Records with same bucketed column always goes
> to
> > the same same bucket. Physically each bucket is a file/files in table
> > directory.
> > Advantages
> > Bucketed table is useful feature to do the map side joins and avoids
> > shuffling of data.
> > Carbondata can do driver level pruning on bucketed column to improve
> query
> > performance.
> >
> > User can add bucketed table as follows
> >
> > CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
> > CLUSTERED BY(user_id) INTO 32 BUCKETS;
> >
> > In the above example column user_id is hash partitioned and creates 32
> > buckets/partitions files in carbondata. So while doing the join with
> other
> > table on bucketed column it can select same buckets and do the join with
> > out shuffling.
> >
> > Carbon creates following folder structure currently, since carbon is
> > already supporting partitioning in its file format
> >
> > dbName -> tableName - > Fact ->
> >
> > Part0 ->Segment_id ->
> > carbondatafiles
> >
> > Part1 ->Segment_id ->
> > carbondatafiles
> >
> > we can also move the partitionid to file metadata.But if we move the
> > partitionId to metadata then there would be complications in backward
> > compatibility.
> > --
> > Thanks & Regards,
> > Ravindra
> >
>

--
Thanks & Regards,
Ravi