Login  Register

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Posted by Aniket Adnaik on Nov 25, 2016; 3:02am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Feature-Design-Document-for-Update-Delete-support-in-CarbonData-tp3043p3183.html

Hi Kumar Vishal,

Yes, valid point. And there have been thoughts about it, there is lot of
scope for optimization of compaction strategies. We may even consider
background monitor process(or cron job or similar) to monitor and trigger
compaction automatically in future.

Best Regards,
Aniket

On Thu, Nov 24, 2016 at 1:32 AM, Kumar Vishal <[hidden email]>
wrote:

> HI Ankiet,
>
> I think If update/delete is for less data then horizontal compaction can
> based on user configuration, but if more data is getting updated then
> better to start vertical compaction immediately , this is because we are
> not physically deleting the data from disk, if more data is getting
> updated(more than 60%) then during query first we will query the older data
> + exclude the deleted records+ include the update delta file data. So in
> this case more data will come into memory, we can avoid this by starting
> vertical compaction immediately after update/delete.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal <[hidden email]>
> wrote:
>
> > Hi Aniket,
> >
> > I agree with Vimal opinion, but that use case will be very less.
> >
> > I have one query for this update and delete feature.
> > When we will start compaction after each update or delete operation?
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> > On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik <[hidden email]
> >
> > wrote:
> >
> >> Hi Vimal,
> >>
> >> Thanks for your suggestions.
> >> For the 1st point, i tend to agree with Manish's comments. But, it's
> worth
> >> looking into different ways to optimize the performance.
> >> I guess, query performance may take priority over update performance.
> >> Basically, we may need better compaction approach to merge
> >> delta files into regular carbon files to maintain adequate performance.
> >> For the 2nd point, CarbonData will support updating multiple rows, but
> not
> >> the same row multiple times in a single update operation. It is possible
> >> that join condition in sub-select of original update statement can
> result
> >> into multiple rows from source table for the same row in the target
> table.
> >> This is ambiguous condition and common ways to solve this is to error
> out
> >> ,
> >> or to apply first matching row, or to apply last matching row.
> CarbonData
> >> will choose to error out and let user resolve the ambiguity, which a
> >> safer/standard choice.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <
> [hidden email]>
> >> wrote:
> >>
> >> > Hi Vimal,
> >> >
> >> > I have few queries regarding regarding the 1st suggestion.
> >> >
> >> > 1. Dimensions can both be dictionary and no dictionary. If we update
> the
> >> > dictionary file then we will have to maintain 2 flows one for
> dictionary
> >> > columns and 1 for no dictionary columns. Will that be ok?
> >> >
> >> > 2. We write dictionary files in append mode. Updating dictionary files
> >> will
> >> > be like completely rewriting the dictionary file which will also
> modify
> >> the
> >> > dictionary metadata and sort index file OR there is some other
> approach
> >> > that needs to be followed like maintaining a update delta mapping for
> >> > dictionary file.
> >> >
> >> > Regards
> >> > Manish Gupta
> >> >
> >> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> >> > [hidden email]> wrote:
> >> >
> >> > > Hi Aniket,
> >> > >
> >> > > The design looks sound and the documentation is great.
> >> > > I have few suggestions.
> >> > >
> >> > > 1) Measure update vs dimension update : In case of dimension update.
> >> for
> >> > > example user wants to change dept1 to dept2 for all users who are
> >> under
> >> > > dept1. Can we just update the dictionary for faster performance?
> >> > > 2) Update Semantics (one matching record vs multiple matching
> >> record): I
> >> > > could not understand this section. Wanted to confirm if we will
> >> support
> >> > one
> >> > > update statement updating multiple rows.
> >> > >
> >> > > -Vimal
> >> > >
> >> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <
> [hidden email]>
> >> > > wrote:
> >> > >
> >> > > > Hi  Aniket
> >> > > >
> >> > > > Thanks you finished the good design documents. A couple of inputs
> >> from
> >> > my
> >> > > > side:
> >> > > >
> >> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
> >> design
> >> > > > documents also.
> >> > > > 2.In page6 :"Schema change operation can run in parallel with
> >> Update or
> >> > > > Delte operations, but not with another schema change operation" ,
> >> can
> >> > you
> >> > > > explain this item ?
> >> > > > 3.Please unify the description:  use "CarbonData" to replace
> >> "Carbon",
> >> > > > unify the description for "destination table" and "target table".
> >> > > > 4.The Update operation's delete delta is same with Delete
> >> operation's
> >> > > > delete
> >> > > > delta?
> >> > > >
> >> > > > BTW, it would be much better if you could provide google docs for
> >> > review
> >> > > in
> >> > > > the next time, it is really difficult to give comment based on pdf
> >> > > > documents
> >> > > > :)
> >> > > >
> >> > > > Regards
> >> > > > Liang
> >> > > >
> >> > > > Aniket Adnaik wrote
> >> > > > > Hi Sujith,
> >> > > > >
> >> > > > > Please see my comments inline.
> >> > > > >
> >> > > > > Best Regards,
> >> > > > > Aniket
> >> > > > >
> >> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
> >> > > >
> >> > > > > sujithchacko.2010@
> >> > > >
> >> > > > > &gt;
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hi Aniket,
> >> > > > >>
> >> > > > >>       Its a well documented design,  just want to know few
> points
> >> > like
> >> > > > >>
> >> > > > >> a.  Format of the RowID and its datatype
> >> > > > >>
> >> > > > >  AA>> Following format can be used to represent a unique rowed;
> >> > > > >
> >> > > > >  [
> >> > > > > <Segment ID>
> >> > > > > <Block ID>
> >> > > > > <Blocklet ID>
> >> > > > > <Offset in Blocklet>
> >> > > > > ]
> >> > > > >  A simple way would be to use String data type and store it as a
> >> text
> >> > > > > file.
> >> > > > > However, more efficient way could be to use Bitsets/Bitmaps as
> >> > further
> >> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be
> >> used
> >> > > for
> >> > > > > better performance and efficient storage.
> >> > > > >
> >> > > > > b.  Impact of this feature in select query since every time
> query
> >> > > process
> >> > > > > has to exclude each deleted records and include corresponding
> >> updated
> >> > > > > record, any optimization is considered in tackling the query
> >> > > performance
> >> > > > > issue since one of the major highlights of carbon is
> performance.
> >> > > > > AA>> Some of the optimizations would be  to cache the deltas to
> >> avoid
> >> > > > > recurrent I/O,
> >> > > > > to store sorted rowids in delete delta for efficient lookup, and
> >> > > perform
> >> > > > > regular compaction to minimize the impact on select query
> >> > performance.
> >> > > > > Additionally, we may have to explore ways to perform compaction
> >> > > > > automatically, for example, if more than 25% of rows are read
> from
> >> > > > deltas.
> >> > > > > Please feel free to share if you have any ideas or suggestions.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Sujith
> >> > > > >
> >> > > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
> >> > > >
> >> > > > > aniket.adnaik@
> >> > > >
> >> > > > > &gt; wrote:
> >> > > > >
> >> > > > >> Hi All,
> >> > > > >>
> >> > > > >> Please find a design doc for Update/Delete support in
> CarbonData.
> >> > > > >>
> >> > > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU
> >> /view?
> >> > > > >> usp=sharing
> >> > > > >>
> >> > > > >> Best Regards,
> >> > > > >> Aniket
> >> > > > >>
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > View this message in context: http://apache-carbondata-
> >> > > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> >> > > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> >> > > > Sent from the Apache CarbonData Mailing List archive mailing list
> >> > archive
> >> > > > at Nabble.com.
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>