[Feature] proposal for update and delete support in Carbon data

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Feature] proposal for update and delete support in Carbon data

Vinod KC
Hi All
I would like to propose following new features in Carbon data
1) Update statement to support modifying existing records in carbon data
table
2) Delete statement to remove records from carbon data table

A) Update operation: 'Update' features can be added to CarbonData using
intermediate Delta files [delete/update delta files] support with lesser
impact on existing code.
Update can be considered as a ‘delete’ followed by an‘insert’ operation.
Once an update is done on carbon data file, on select query operation,
Carbondata store reader can make use of delete delta data cache to exclude
deleted records in that segment and then include records from newly added
update delta files.

B) Delete operation: In the case of delete operation, a delete delta file
will be added to each segment matching the records. During select query
operation Carbon data reader will exclude those deleted records from the
result set.

Please share your suggestions and thoughts about design and functional
aspects on this feature. I’ll share a detailed design document about above
thoughts later.

Regards
Vinod
Reply | Threaded
Open this post in threaded view
|

Re: [Feature] proposal for update and delete support in Carbon data

Jacky Li
Hi Vinod,

It is great to have this feature, as there were many people asking for data update during the CarbonData meetup earlier. I believe it will be useful for many big data applications.

For the solution you proposed, I have following doubts:
1. Data update is complex as if transaction is involved, so what kind of ACID level support are you thinking about?
2. If I understand correctly, you are proposing to do data update via base + delta file approach, right? So in this case, new file format needs to be added in CarbonData project.
3. As CarbonData has builtin support for index, any idea what is the impaction to the B tree index already in driver and executor memory?

Regards,
Jacky

> 在 2016年11月15日,下午12:25,Vinod KC <[hidden email]> 写道:
>
> Hi All
> I would like to propose following new features in Carbon data
> 1) Update statement to support modifying existing records in carbon data
> table
> 2) Delete statement to remove records from carbon data table
>
> A) Update operation: 'Update' features can be added to CarbonData using
> intermediate Delta files [delete/update delta files] support with lesser
> impact on existing code.
> Update can be considered as a ‘delete’ followed by an‘insert’ operation.
> Once an update is done on carbon data file, on select query operation,
> Carbondata store reader can make use of delete delta data cache to exclude
> deleted records in that segment and then include records from newly added
> update delta files.
>
> B) Delete operation: In the case of delete operation, a delete delta file
> will be added to each segment matching the records. During select query
> operation Carbon data reader will exclude those deleted records from the
> result set.
>
> Please share your suggestions and thoughts about design and functional
> aspects on this feature. I’ll share a detailed design document about above
> thoughts later.
>
> Regards
> Vinod



Reply | Threaded
Open this post in threaded view
|

Re: [Feature] proposal for update and delete support in Carbon data

hexiaoqiao
hi Vinod,

It is an expected feature for many people as Jacky mentioned. I think
Update/Delete should be basic module for CarbonData, meanwhile it is
complex question for distributed storage system. The solution you proposed
is based on traditional 'Base + Delta' approach, which is applied on
bigtable/hbase/kudu/etc successfully. following your proposed solution for
CarbonData i have some confusion include doubts Jacky mentioned transaction
and index:

1. How to trade-off IO overhead when add delta files. i think there may be
two query approaches for delta files: (1) load whole delta data and replace
based query result if also exist in delta file. in this case, it may
increase IO overhead which CarbonData try to reduce it as possible.  (2)
build separate index for all delta file, or label delta records and upgrade
file format. right?
2. When and how to do minor/major compaction on (base + delta) or (delta +
delta)?
3. Any questions for update or delete Directory item?

I look forward to the detailed design of your solution.

Please correct me if i am wrong.

Best Regards,
He Xiaoqiao


On Tue, Nov 15, 2016 at 5:39 PM, Jacky Li <[hidden email]> wrote:

> Hi Vinod,
>
> It is great to have this feature, as there were many people asking for
> data update during the CarbonData meetup earlier. I believe it will be
> useful for many big data applications.
>
> For the solution you proposed, I have following doubts:
> 1. Data update is complex as if transaction is involved, so what kind of
> ACID level support are you thinking about?
> 2. If I understand correctly, you are proposing to do data update via base
> + delta file approach, right? So in this case, new file format needs to be
> added in CarbonData project.
> 3. As CarbonData has builtin support for index, any idea what is the
> impaction to the B tree index already in driver and executor memory?
>
> Regards,
> Jacky
>
> > 在 2016年11月15日,下午12:25,Vinod KC <[hidden email]> 写道:
> >
> > Hi All
> > I would like to propose following new features in Carbon data
> > 1) Update statement to support modifying existing records in carbon data
> > table
> > 2) Delete statement to remove records from carbon data table
> >
> > A) Update operation: 'Update' features can be added to CarbonData using
> > intermediate Delta files [delete/update delta files] support with lesser
> > impact on existing code.
> > Update can be considered as a ‘delete’ followed by an‘insert’ operation.
> > Once an update is done on carbon data file, on select query operation,
> > Carbondata store reader can make use of delete delta data cache to
> exclude
> > deleted records in that segment and then include records from newly added
> > update delta files.
> >
> > B) Delete operation: In the case of delete operation, a delete delta file
> > will be added to each segment matching the records. During select query
> > operation Carbon data reader will exclude those deleted records from the
> > result set.
> >
> > Please share your suggestions and thoughts about design and functional
> > aspects on this feature. I’ll share a detailed design document about
> above
> > thoughts later.
> >
> > Regards
> > Vinod
>
>
>
>