Login  Register

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Posted by Aniket Adnaik on Nov 23, 2016; 6:35pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Feature-Design-Document-for-Update-Delete-support-in-CarbonData-tp3043p3133.html

Hi Vimal,

Thanks for your suggestions.
For the 1st point, i tend to agree with Manish's comments. But, it's worth
looking into different ways to optimize the performance.
I guess, query performance may take priority over update performance.
Basically, we may need better compaction approach to merge
delta files into regular carbon files to maintain adequate performance.
For the 2nd point, CarbonData will support updating multiple rows, but not
the same row multiple times in a single update operation. It is possible
that join condition in sub-select of original update statement can result
into multiple rows from source table for the same row in the target table.
This is ambiguous condition and common ways to solve this is to error out ,
or to apply first matching row, or to apply last matching row. CarbonData
will choose to error out and let user resolve the ambiguity, which a
safer/standard choice.

Best Regards,
Aniket

On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <[hidden email]>
wrote:

> Hi Vimal,
>
> I have few queries regarding regarding the 1st suggestion.
>
> 1. Dimensions can both be dictionary and no dictionary. If we update the
> dictionary file then we will have to maintain 2 flows one for dictionary
> columns and 1 for no dictionary columns. Will that be ok?
>
> 2. We write dictionary files in append mode. Updating dictionary files will
> be like completely rewriting the dictionary file which will also modify the
> dictionary metadata and sort index file OR there is some other approach
> that needs to be followed like maintaining a update delta mapping for
> dictionary file.
>
> Regards
> Manish Gupta
>
> On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> [hidden email]> wrote:
>
> > Hi Aniket,
> >
> > The design looks sound and the documentation is great.
> > I have few suggestions.
> >
> > 1) Measure update vs dimension update : In case of dimension update. for
> > example user wants to change dept1 to dept2 for all users who are under
> > dept1. Can we just update the dictionary for faster performance?
> > 2) Update Semantics (one matching record vs multiple matching record): I
> > could not understand this section. Wanted to confirm if we will support
> one
> > update statement updating multiple rows.
> >
> > -Vimal
> >
> > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[hidden email]>
> > wrote:
> >
> > > Hi  Aniket
> > >
> > > Thanks you finished the good design documents. A couple of inputs from
> my
> > > side:
> > >
> > > 1.Please add the below mentioned info(Rowid definition etc.) to design
> > > documents also.
> > > 2.In page6 :"Schema change operation can run in parallel with Update or
> > > Delte operations, but not with another schema change operation" , can
> you
> > > explain this item ?
> > > 3.Please unify the description:  use "CarbonData" to replace "Carbon",
> > > unify the description for "destination table" and "target table".
> > > 4.The Update operation's delete delta is same with Delete operation's
> > > delete
> > > delta?
> > >
> > > BTW, it would be much better if you could provide google docs for
> review
> > in
> > > the next time, it is really difficult to give comment based on pdf
> > > documents
> > > :)
> > >
> > > Regards
> > > Liang
> > >
> > > Aniket Adnaik wrote
> > > > Hi Sujith,
> > > >
> > > > Please see my comments inline.
> > > >
> > > > Best Regards,
> > > > Aniket
> > > >
> > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
> > >
> > > > sujithchacko.2010@
> > >
> > > > &gt;
> > > > wrote:
> > > >
> > > >> Hi Aniket,
> > > >>
> > > >>       Its a well documented design,  just want to know few points
> like
> > > >>
> > > >> a.  Format of the RowID and its datatype
> > > >>
> > > >  AA>> Following format can be used to represent a unique rowed;
> > > >
> > > >  [
> > > > <Segment ID>
> > > > <Block ID>
> > > > <Blocklet ID>
> > > > <Offset in Blocklet>
> > > > ]
> > > >  A simple way would be to use String data type and store it as a text
> > > > file.
> > > > However, more efficient way could be to use Bitsets/Bitmaps as
> further
> > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be used
> > for
> > > > better performance and efficient storage.
> > > >
> > > > b.  Impact of this feature in select query since every time query
> > process
> > > > has to exclude each deleted records and include corresponding updated
> > > > record, any optimization is considered in tackling the query
> > performance
> > > > issue since one of the major highlights of carbon is performance.
> > > > AA>> Some of the optimizations would be  to cache the deltas to avoid
> > > > recurrent I/O,
> > > > to store sorted rowids in delete delta for efficient lookup, and
> > perform
> > > > regular compaction to minimize the impact on select query
> performance.
> > > > Additionally, we may have to explore ways to perform compaction
> > > > automatically, for example, if more than 25% of rows are read from
> > > deltas.
> > > > Please feel free to share if you have any ideas or suggestions.
> > > >
> > > > Thanks,
> > > > Sujith
> > > >
> > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
> > >
> > > > aniket.adnaik@
> > >
> > > > &gt; wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> Please find a design doc for Update/Delete support in CarbonData.
> > > >>
> > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> > > >> usp=sharing
> > > >>
> > > >> Best Regards,
> > > >> Aniket
> > > >>
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>