Hi,
One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was expected to improve join performance by avoiding shuffling if both tables are bucketed on same column with same number of buckets. However, after this feature was introduced, personally speaking it was not widely used in the community and it creates maintenance overhead for the developers in the community (for very new Pull Request, all bucket related testcase need to be fixed) And now carbon has integrated with spark standard partition, developer can add bucket support using spark bucketed table feature in future if it requires. So, I propose to remove bucket feature after CarbonData 1.3.0 version. What do you think? Regards, Jacky |
Hi Likun,
I feel it is better to change the implementation to use sparks bucketing generation just like how standard hive partitions generates. It will be easy to change it after implementing of partition feature. And it is a useful feature for joining big tables and hash based buckets and clustered by enables the queries faster. So it is better to change the implementation instead of removing it. Regards, Ravindra. On 9 February 2018 at 13:14, Jacky Li <[hidden email]> wrote: > Hi, > > One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was > expected to improve join performance by avoiding shuffling if both tables > are bucketed on same column with same number of buckets. > > However, after this feature was introduced, personally speaking it was not > widely used in the community and it creates maintenance overhead for the > developers in the community (for very new Pull Request, all bucket related > testcase need to be fixed) > > And now carbon has integrated with spark standard partition, developer can > add bucket support using spark bucketed table feature in future if it > requires. > > So, I propose to remove bucket feature after CarbonData 1.3.0 version. > What do you think? > > Regards, > Jacky > > -- Thanks & Regards, Ravi |
Hi Ravindra,
You mean we can do one round of refactory for bucketed table feature in CarbonData 1.4. I am fine with it. Regards, Jacky > 在 2018年2月9日,下午3:49,Ravindra Pesala <[hidden email]> 写道: > > Hi Likun, > > I feel it is better to change the implementation to use sparks bucketing > generation just like how standard hive partitions generates. It will be > easy to change it after implementing of partition feature. And it is a > useful feature for joining big tables and hash based buckets and clustered > by enables the queries faster. So it is better to change the > implementation instead of removing it. > > Regards, > Ravindra. > > On 9 February 2018 at 13:14, Jacky Li <[hidden email]> wrote: > >> Hi, >> >> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was >> expected to improve join performance by avoiding shuffling if both tables >> are bucketed on same column with same number of buckets. >> >> However, after this feature was introduced, personally speaking it was not >> widely used in the community and it creates maintenance overhead for the >> developers in the community (for very new Pull Request, all bucket related >> testcase need to be fixed) >> >> And now carbon has integrated with spark standard partition, developer can >> add bucket support using spark bucketed table feature in future if it >> requires. >> >> So, I propose to remove bucket feature after CarbonData 1.3.0 version. >> What do you think? >> >> Regards, >> Jacky >> >> > > > -- > Thanks & Regards, > Ravi |
Yes Jacky, we will do refactor and use the partition flow.
On 9 February 2018 at 13:44, Jacky Li <[hidden email]> wrote: > Hi Ravindra, > > You mean we can do one round of refactory for bucketed table feature in > CarbonData 1.4. > I am fine with it. > > Regards, > Jacky > > > > 在 2018年2月9日,下午3:49,Ravindra Pesala <[hidden email]> 写道: > > > > Hi Likun, > > > > I feel it is better to change the implementation to use sparks bucketing > > generation just like how standard hive partitions generates. It will be > > easy to change it after implementing of partition feature. And it is a > > useful feature for joining big tables and hash based buckets and > clustered > > by enables the queries faster. So it is better to change the > > implementation instead of removing it. > > > > Regards, > > Ravindra. > > > > On 9 February 2018 at 13:14, Jacky Li <[hidden email]> wrote: > > > >> Hi, > >> > >> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it > was > >> expected to improve join performance by avoiding shuffling if both > tables > >> are bucketed on same column with same number of buckets. > >> > >> However, after this feature was introduced, personally speaking it was > not > >> widely used in the community and it creates maintenance overhead for the > >> developers in the community (for very new Pull Request, all bucket > related > >> testcase need to be fixed) > >> > >> And now carbon has integrated with spark standard partition, developer > can > >> add bucket support using spark bucketed table feature in future if it > >> requires. > >> > >> So, I propose to remove bucket feature after CarbonData 1.3.0 version. > >> What do you think? > >> > >> Regards, > >> Jacky > >> > >> > > > > > > -- > > Thanks & Regards, > > Ravi > > > > -- Thanks & Regards, Ravi |
Free forum by Nabble | Edit this page |