Apache CarbonData Dev Mailing List archive

About bucket feature in carbon

Classic

List

Threaded

4 messages Options

Jacky Li

About bucket feature in carbon

Hi,

One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was expected to improve join performance by avoiding shuffling if both tables are bucketed on same column with same number of buckets.

However, after this feature was introduced, personally speaking it was not widely used in the community and it creates maintenance overhead for the developers in the community (for very new Pull Request, all bucket related testcase need to be fixed)

And now carbon has integrated with spark standard partition, developer can add bucket support using spark bucketed table feature in future if it requires.

So, I propose to remove bucket feature after CarbonData 1.3.0 version.
What do you think?

Regards,
Jacky

ravipesala

Re: About bucket feature in carbon

Hi Likun,

I feel it is better to change the implementation to use sparks bucketing
generation just like how standard hive partitions generates. It will be
easy to change it after implementing of partition feature. And it is a
useful feature for joining big tables and hash based buckets and clustered
by enables the queries faster. So it is better to change the
implementation instead of removing it.

Regards,
Ravindra.

On 9 February 2018 at 13:14, Jacky Li <[hidden email]> wrote:

> Hi,
>
> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was
> expected to improve join performance by avoiding shuffling if both tables
> are bucketed on same column with same number of buckets.
>
> However, after this feature was introduced, personally speaking it was not
> widely used in the community and it creates maintenance overhead for the
> developers in the community (for very new Pull Request, all bucket related
> testcase need to be fixed)
>
> And now carbon has integrated with spark standard partition, developer can
> add bucket support using spark bucketed table feature in future if it
> requires.
>
> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
> What do you think?
>
> Regards,
> Jacky
>
>

--
Thanks & Regards,
Ravi

Jacky Li-2

Re: About bucket feature in carbon

Hi Ravindra,

You mean we can do one round of refactory for bucketed table feature in CarbonData 1.4.
I am fine with it.

Regards,
Jacky

> 在 2018年2月9日，下午3:49，Ravindra Pesala <[hidden email]> 写道：
>
> Hi Likun,
>
> I feel it is better to change the implementation to use sparks bucketing
> generation just like how standard hive partitions generates. It will be
> easy to change it after implementing of partition feature. And it is a
> useful feature for joining big tables and hash based buckets and clustered
> by enables the queries faster. So it is better to change the
> implementation instead of removing it.
>
> Regards,
> Ravindra.
>
> On 9 February 2018 at 13:14, Jacky Li <[hidden email]> wrote:
>
>> Hi,
>>
>> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was
>> expected to improve join performance by avoiding shuffling if both tables
>> are bucketed on same column with same number of buckets.
>>
>> However, after this feature was introduced, personally speaking it was not
>> widely used in the community and it creates maintenance overhead for the
>> developers in the community (for very new Pull Request, all bucket related
>> testcase need to be fixed)
>>
>> And now carbon has integrated with spark standard partition, developer can
>> add bucket support using spark bucketed table feature in future if it
>> requires.
>>
>> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
>> What do you think?
>>
>> Regards,
>> Jacky
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi

ravipesala

Re: About bucket feature in carbon

Yes Jacky, we will do refactor and use the partition flow.

On 9 February 2018 at 13:44, Jacky Li <[hidden email]> wrote:

> Hi Ravindra,
>
> You mean we can do one round of refactory for bucketed table feature in
> CarbonData 1.4.
> I am fine with it.
>
> Regards,
> Jacky
>
>
> > 在 2018年2月9日，下午3:49，Ravindra Pesala <[hidden email]> 写道：
> >
> > Hi Likun,
> >
> > I feel it is better to change the implementation to use sparks bucketing
> > generation just like how standard hive partitions generates. It will be
> > easy to change it after implementing of partition feature. And it is a
> > useful feature for joining big tables and hash based buckets and
> clustered
> > by enables the queries faster. So it is better to change the
> > implementation instead of removing it.
> >
> > Regards,
> > Ravindra.
> >
> > On 9 February 2018 at 13:14, Jacky Li <[hidden email]> wrote:
> >
> >> Hi,
> >>
> >> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it
> was
> >> expected to improve join performance by avoiding shuffling if both
> tables
> >> are bucketed on same column with same number of buckets.
> >>
> >> However, after this feature was introduced, personally speaking it was
> not
> >> widely used in the community and it creates maintenance overhead for the
> >> developers in the community (for very new Pull Request, all bucket
> related
> >> testcase need to be fixed)
> >>
> >> And now carbon has integrated with spark standard partition, developer
> can
> >> add bucket support using spark bucketed table feature in future if it
> >> requires.
> >>
> >> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
> >> What do you think?
> >>
> >> Regards,
> >> Jacky
> >>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>

--
Thanks & Regards,
Ravi