http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/New-Feature-Adding-bucketed-table-feature-to-Carbondata-tp3253p3256.html
In Hive's or Spark's terminology Partitioning and bucketing are different.
depending upon the cardinality of that partitioned column. So it is very
in-effective if cardinality is higher.
configurable number) depends on hashing of that column. It is useful for
high cardinality columns.
> How is this different from partitioning?
> On Sun, 27 Nov 2016 at 11:21 PM, Ravindra Pesala <
[hidden email]>
> wrote:
>
> > Hi All,
> >
> > Bucketing concept is based on the hash partition the bucketed column as
> per
> > configured bucket numbers. Records with same bucketed column always goes
> to
> > the same same bucket. Physically each bucket is a file/files in table
> > directory.
> > Advantages
> > Bucketed table is useful feature to do the map side joins and avoids
> > shuffling of data.
> > Carbondata can do driver level pruning on bucketed column to improve
> query
> > performance.
> >
> > User can add bucketed table as follows
> >
> > CREATE TABLE test(user_id BIGINT, firstname STRING, lastname STRING)
> > CLUSTERED BY(user_id) INTO 32 BUCKETS;
> >
> > In the above example column user_id is hash partitioned and creates 32
> > buckets/partitions files in carbondata. So while doing the join with
> other
> > table on bucketed column it can select same buckets and do the join with
> > out shuffling.
> >
> > Carbon creates following folder structure currently, since carbon is
> > already supporting partitioning in its file format
> >
> > dbName -> tableName - > Fact ->
> >
> > Part0 ->Segment_id ->
> > carbondatafiles
> >
> > Part1 ->Segment_id ->
> > carbondatafiles
> >
> > we can also move the partitionid to file metadata.But if we move the
> > partitionId to metadata then there would be complications in backward
> > compatibility.
> > --
> > Thanks & Regards,
> > Ravindra
> >
>