[Feature ]Design Document for Update/Delete support in CarbonData

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
Hi All,

Please find a design doc for Update/Delete support in CarbonData.

https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?usp=sharing

Best Regards,
Aniket
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

hexiaoqiao
hi Aniket Adnaik,

It is a great design document about update/delete and very useful feature
for CarbonData.

For the solution you proposed, i think the most difficult challenge is
Compaction. If without careful attention, rewriting data over and over can
lead to some serious network and disk over-subscription, In other words,
compaction is about trading some disk IO now for fewer seeks later, as
HBase and LevelDB raise the same issue.

The following compaction solution for LevelDB/HBase could be reference for
the detailed design. FYI.

   - FIFO Compaction(HBASE-14468
   <https://issues.apache.org/jira/browse/HBASE-14468>)
   - Tier-Based Compaction(HBASE-7055
   <https://issues.apache.org/jira/browse/HBASE-7055>,HBASE-14477
   <https://issues.apache.org/jira/browse/HBASE-14477>)
   - Level Compaction(LevelDB Implementation notes
   <https://rawgit.com/google/leveldb/master/doc/impl.html>)/Stripe
   Compaction(HBASE-7667 <https://issues.apache.org/jira/browse/HBASE-7667>)

Please correct me if I am wrong.

Regards,
He Xiaoqiao


On Sun, Nov 20, 2016 at 11:54 PM, Aniket Adnaik <[hidden email]>
wrote:

> Hi All,
>
> Please find a design doc for Update/Delete support in CarbonData.
>
> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> usp=sharing
>
> Best Regards,
> Aniket
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

sujith chacko
In reply to this post by Aniket Adnaik
Hi Aniket,

      Its a well documented design,  just want to know few points like

a.  Format of the RowID and its datatype

b.  Impact of this feature in select query since every time query process
has to exclude each deleted records and include corresponding updated
record, any optimization is considered in tackling the query performance
issue since one of the major highlights of carbon is performance.

Thanks,
Sujith

On Nov 20, 2016 9:24 PM, "Aniket Adnaik" <[hidden email]> wrote:

> Hi All,
>
> Please find a design doc for Update/Delete support in CarbonData.
>
> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> usp=sharing
>
> Best Regards,
> Aniket
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
Hi Sujith,

Please see my comments inline.

Best Regards,
Aniket

On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko <[hidden email]>
wrote:

> Hi Aniket,
>
>       Its a well documented design,  just want to know few points like
>
> a.  Format of the RowID and its datatype
>
 AA>> Following format can be used to represent a unique rowed;

 [<Segment ID><Block ID><Blocklet ID><Offset in Blocklet>]
 A simple way would be to use String data type and store it as a text file.
However, more efficient way could be to use Bitsets/Bitmaps as further
optimization. Compressed Bitmaps such as Roaring bitmaps can be used for
better performance and efficient storage.

b.  Impact of this feature in select query since every time query process
has to exclude each deleted records and include corresponding updated
record, any optimization is considered in tackling the query performance
issue since one of the major highlights of carbon is performance.
AA>> Some of the optimizations would be  to cache the deltas to avoid
recurrent I/O,
to store sorted rowids in delete delta for efficient lookup, and perform
regular compaction to minimize the impact on select query performance.
Additionally, we may have to explore ways to perform compaction
automatically, for example, if more than 25% of rows are read from deltas.
Please feel free to share if you have any ideas or suggestions.

Thanks,
Sujith

On Nov 20, 2016 9:24 PM, "Aniket Adnaik" <[hidden email]> wrote:

> Hi All,
>
> Please find a design doc for Update/Delete support in CarbonData.
>
> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> usp=sharing
>
> Best Regards,
> Aniket
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
In reply to this post by hexiaoqiao
Hi He Xiaoqiao,

Yes , you are right, compaction is an important part (along with delta
merging) to maintain adequate read performance and it needs to be done in
efficient manner.
Thanks for sharing the useful links.

Best Regards,
Aniket

On Sun, Nov 20, 2016 at 9:10 AM, Xiaoqiao He <[hidden email]> wrote:

> hi Aniket Adnaik,
>
> It is a great design document about update/delete and very useful feature
> for CarbonData.
>
> For the solution you proposed, i think the most difficult challenge is
> Compaction. If without careful attention, rewriting data over and over can
> lead to some serious network and disk over-subscription, In other words,
> compaction is about trading some disk IO now for fewer seeks later, as
> HBase and LevelDB raise the same issue.
>
> The following compaction solution for LevelDB/HBase could be reference for
> the detailed design. FYI.
>
>    - FIFO Compaction(HBASE-14468
>    <https://issues.apache.org/jira/browse/HBASE-14468>)
>    - Tier-Based Compaction(HBASE-7055
>    <https://issues.apache.org/jira/browse/HBASE-7055>,HBASE-14477
>    <https://issues.apache.org/jira/browse/HBASE-14477>)
>    - Level Compaction(LevelDB Implementation notes
>    <https://rawgit.com/google/leveldb/master/doc/impl.html>)/Stripe
>    Compaction(HBASE-7667 <https://issues.apache.org/jira/browse/HBASE-7667
> >)
>
> Please correct me if I am wrong.
>
> Regards,
> He Xiaoqiao
>
>
> On Sun, Nov 20, 2016 at 11:54 PM, Aniket Adnaik <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > Please find a design doc for Update/Delete support in CarbonData.
> >
> > https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> > usp=sharing
> >
> > Best Regards,
> > Aniket
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

manishgupta88
In reply to this post by Aniket Adnaik
Hi Aniket,

I think in RowID format we should also include partitionID. Currently
carbon is not supporting partition but going forward when we support
partitioning, this format would comply with it.

 [<Partition ID><Segment ID><Block ID><Blocklet ID><Offset in Blocklet>]

Regards
Manish Gupta

On Mon, Nov 21, 2016 at 1:07 PM, Aniket Adnaik <[hidden email]>
wrote:

> Hi Sujith,
>
> Please see my comments inline.
>
> Best Regards,
> Aniket
>
> On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko <
> [hidden email]>
> wrote:
>
> > Hi Aniket,
> >
> >       Its a well documented design,  just want to know few points like
> >
> > a.  Format of the RowID and its datatype
> >
>  AA>> Following format can be used to represent a unique rowed;
>
>  [<Segment ID><Block ID><Blocklet ID><Offset in Blocklet>]
>  A simple way would be to use String data type and store it as a text file.
> However, more efficient way could be to use Bitsets/Bitmaps as further
> optimization. Compressed Bitmaps such as Roaring bitmaps can be used for
> better performance and efficient storage.
>
> b.  Impact of this feature in select query since every time query process
> has to exclude each deleted records and include corresponding updated
> record, any optimization is considered in tackling the query performance
> issue since one of the major highlights of carbon is performance.
> AA>> Some of the optimizations would be  to cache the deltas to avoid
> recurrent I/O,
> to store sorted rowids in delete delta for efficient lookup, and perform
> regular compaction to minimize the impact on select query performance.
> Additionally, we may have to explore ways to perform compaction
> automatically, for example, if more than 25% of rows are read from deltas.
> Please feel free to share if you have any ideas or suggestions.
>
> Thanks,
> Sujith
>
> On Nov 20, 2016 9:24 PM, "Aniket Adnaik" <[hidden email]> wrote:
>
> > Hi All,
> >
> > Please find a design doc for Update/Delete support in CarbonData.
> >
> > https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> > usp=sharing
> >
> > Best Regards,
> > Aniket
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
Hi Manish,

Yes, I agree, we'll have to include partition id if we start supporting
partitioning in future.  There might be other options, such as making
segment id unique enough to include partition id as a part of it.
On a side note - we may need transaction id as well if we start supporting
transaction semantics in future.

Best Regards,
Aniket

On Mon, Nov 21, 2016 at 4:00 AM, manish gupta <[hidden email]>
wrote:

> Hi Aniket,
>
> I think in RowID format we should also include partitionID. Currently
> carbon is not supporting partition but going forward when we support
> partitioning, this format would comply with it.
>
>  [<Partition ID><Segment ID><Block ID><Blocklet ID><Offset in Blocklet>]
>
> Regards
> Manish Gupta
>
> On Mon, Nov 21, 2016 at 1:07 PM, Aniket Adnaik <[hidden email]>
> wrote:
>
> > Hi Sujith,
> >
> > Please see my comments inline.
> >
> > Best Regards,
> > Aniket
> >
> > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko <
> > [hidden email]>
> > wrote:
> >
> > > Hi Aniket,
> > >
> > >       Its a well documented design,  just want to know few points like
> > >
> > > a.  Format of the RowID and its datatype
> > >
> >  AA>> Following format can be used to represent a unique rowed;
> >
> >  [<Segment ID><Block ID><Blocklet ID><Offset in Blocklet>]
> >  A simple way would be to use String data type and store it as a text
> file.
> > However, more efficient way could be to use Bitsets/Bitmaps as further
> > optimization. Compressed Bitmaps such as Roaring bitmaps can be used for
> > better performance and efficient storage.
> >
> > b.  Impact of this feature in select query since every time query process
> > has to exclude each deleted records and include corresponding updated
> > record, any optimization is considered in tackling the query performance
> > issue since one of the major highlights of carbon is performance.
> > AA>> Some of the optimizations would be  to cache the deltas to avoid
> > recurrent I/O,
> > to store sorted rowids in delete delta for efficient lookup, and perform
> > regular compaction to minimize the impact on select query performance.
> > Additionally, we may have to explore ways to perform compaction
> > automatically, for example, if more than 25% of rows are read from
> deltas.
> > Please feel free to share if you have any ideas or suggestions.
> >
> > Thanks,
> > Sujith
> >
> > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" <[hidden email]>
> wrote:
> >
> > > Hi All,
> > >
> > > Please find a design doc for Update/Delete support in CarbonData.
> > >
> > > https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> > > usp=sharing
> > >
> > > Best Regards,
> > > Aniket
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Liang Chen
Administrator
In reply to this post by Aniket Adnaik
Hi  Aniket

Thanks you finished the good design documents. A couple of inputs from my side:

1.Please add the below mentioned info(Rowid definition etc.) to design documents also.
2.In page6 :"Schema change operation can run in parallel with Update or Delte operations, but not with another schema change operation" , can you explain this item ?
3.Please unify the description:  use "CarbonData" to replace "Carbon",  unify the description for "destination table" and "target table".
4.The Update operation's delete delta is same with Delete operation's delete delta?

BTW, it would be much better if you could provide google docs for review in the next time, it is really difficult to give comment based on pdf documents :)

Regards
Liang
Aniket Adnaik wrote
Hi Sujith,

Please see my comments inline.

Best Regards,
Aniket

On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko <[hidden email]>
wrote:

> Hi Aniket,
>
>       Its a well documented design,  just want to know few points like
>
> a.  Format of the RowID and its datatype
>
 AA>> Following format can be used to represent a unique rowed;

 [<Segment ID><Block ID><Blocklet ID><Offset in Blocklet>]
 A simple way would be to use String data type and store it as a text file.
However, more efficient way could be to use Bitsets/Bitmaps as further
optimization. Compressed Bitmaps such as Roaring bitmaps can be used for
better performance and efficient storage.

b.  Impact of this feature in select query since every time query process
has to exclude each deleted records and include corresponding updated
record, any optimization is considered in tackling the query performance
issue since one of the major highlights of carbon is performance.
AA>> Some of the optimizations would be  to cache the deltas to avoid
recurrent I/O,
to store sorted rowids in delete delta for efficient lookup, and perform
regular compaction to minimize the impact on select query performance.
Additionally, we may have to explore ways to perform compaction
automatically, for example, if more than 25% of rows are read from deltas.
Please feel free to share if you have any ideas or suggestions.

Thanks,
Sujith

On Nov 20, 2016 9:24 PM, "Aniket Adnaik" <[hidden email]> wrote:

> Hi All,
>
> Please find a design doc for Update/Delete support in CarbonData.
>
> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> usp=sharing
>
> Best Regards,
> Aniket
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
Hi Liang,

Please see my comments inline.

Best Regards,
Aniket

On Tue, Nov 22, 2016 at 1:00 AM, Liang Chen <[hidden email]> wrote:

> Hi  Aniket
>
> Thanks you finished the good design documents. A couple of inputs from my
> side:
>
> 1.Please add the below mentioned info(Rowid definition etc.) to design
> documents also.
>
AA>> yes, its good to have this info into the document.

> 2.In page6 :"Schema change operation can run in parallel with Update or
> Delte operations, but not with another schema change operation" , can you
> explain this item ?
>
AA>>  synchronization for schema change operations like db name change or
properties change are handled separately,
allowing update or delete operation to work in parallel with schema change
operation.

> 3.Please unify the description:  use "CarbonData" to replace "Carbon",
> unify the description for "destination table" and "target table".
>
AA> yes, I will update the document accordingly.

> 4.The Update operation's delete delta is same with Delete operation's
> delete
> delta?
>
AA>> yes, delete delta is nothing but the rowids of qualifying rows that
needs to be deleted.

>
> BTW, it would be much better if you could provide google docs for review in
> the next time, it is really difficult to give comment based on pdf
> documents
> :)
> AA>> Yes I agree :). Unfortunately, google docs totally messed up the
> diagrams when I first tried to save it into google docs.

, I was unable to solve that issue so uploaded as pdf.
>



> Regards
> Liang
>
> Aniket Adnaik wrote
> > Hi Sujith,
> >
> > Please see my comments inline.
> >
> > Best Regards,
> > Aniket
> >
> > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
>
> > sujithchacko.2010@
>
> > &gt;
> > wrote:
> >
> >> Hi Aniket,
> >>
> >>       Its a well documented design,  just want to know few points like
> >>
> >> a.  Format of the RowID and its datatype
> >>
> >  AA>> Following format can be used to represent a unique rowed;
> >
> >  [
> > <Segment ID>
> > <Block ID>
> > <Blocklet ID>
> > <Offset in Blocklet>
> > ]
> >  A simple way would be to use String data type and store it as a text
> > file.
> > However, more efficient way could be to use Bitsets/Bitmaps as further
> > optimization. Compressed Bitmaps such as Roaring bitmaps can be used for
> > better performance and efficient storage.
> >
> > b.  Impact of this feature in select query since every time query process
> > has to exclude each deleted records and include corresponding updated
> > record, any optimization is considered in tackling the query performance
> > issue since one of the major highlights of carbon is performance.
> > AA>> Some of the optimizations would be  to cache the deltas to avoid
> > recurrent I/O,
> > to store sorted rowids in delete delta for efficient lookup, and perform
> > regular compaction to minimize the impact on select query performance.
> > Additionally, we may have to explore ways to perform compaction
> > automatically, for example, if more than 25% of rows are read from
> deltas.
> > Please feel free to share if you have any ideas or suggestions.
> >
> > Thanks,
> > Sujith
> >
> > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
>
> > aniket.adnaik@
>
> > &gt; wrote:
> >
> >> Hi All,
> >>
> >> Please find a design doc for Update/Delete support in CarbonData.
> >>
> >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> >> usp=sharing
> >>
> >> Best Regards,
> >> Aniket
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Vimal Das Kammath
In reply to this post by Liang Chen
Hi Aniket,

The design looks sound and the documentation is great.
I have few suggestions.

1) Measure update vs dimension update : In case of dimension update. for
example user wants to change dept1 to dept2 for all users who are under
dept1. Can we just update the dictionary for faster performance?
2) Update Semantics (one matching record vs multiple matching record): I
could not understand this section. Wanted to confirm if we will support one
update statement updating multiple rows.

-Vimal

On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[hidden email]> wrote:

> Hi  Aniket
>
> Thanks you finished the good design documents. A couple of inputs from my
> side:
>
> 1.Please add the below mentioned info(Rowid definition etc.) to design
> documents also.
> 2.In page6 :"Schema change operation can run in parallel with Update or
> Delte operations, but not with another schema change operation" , can you
> explain this item ?
> 3.Please unify the description:  use "CarbonData" to replace "Carbon",
> unify the description for "destination table" and "target table".
> 4.The Update operation's delete delta is same with Delete operation's
> delete
> delta?
>
> BTW, it would be much better if you could provide google docs for review in
> the next time, it is really difficult to give comment based on pdf
> documents
> :)
>
> Regards
> Liang
>
> Aniket Adnaik wrote
> > Hi Sujith,
> >
> > Please see my comments inline.
> >
> > Best Regards,
> > Aniket
> >
> > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
>
> > sujithchacko.2010@
>
> > &gt;
> > wrote:
> >
> >> Hi Aniket,
> >>
> >>       Its a well documented design,  just want to know few points like
> >>
> >> a.  Format of the RowID and its datatype
> >>
> >  AA>> Following format can be used to represent a unique rowed;
> >
> >  [
> > <Segment ID>
> > <Block ID>
> > <Blocklet ID>
> > <Offset in Blocklet>
> > ]
> >  A simple way would be to use String data type and store it as a text
> > file.
> > However, more efficient way could be to use Bitsets/Bitmaps as further
> > optimization. Compressed Bitmaps such as Roaring bitmaps can be used for
> > better performance and efficient storage.
> >
> > b.  Impact of this feature in select query since every time query process
> > has to exclude each deleted records and include corresponding updated
> > record, any optimization is considered in tackling the query performance
> > issue since one of the major highlights of carbon is performance.
> > AA>> Some of the optimizations would be  to cache the deltas to avoid
> > recurrent I/O,
> > to store sorted rowids in delete delta for efficient lookup, and perform
> > regular compaction to minimize the impact on select query performance.
> > Additionally, we may have to explore ways to perform compaction
> > automatically, for example, if more than 25% of rows are read from
> deltas.
> > Please feel free to share if you have any ideas or suggestions.
> >
> > Thanks,
> > Sujith
> >
> > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
>
> > aniket.adnaik@
>
> > &gt; wrote:
> >
> >> Hi All,
> >>
> >> Please find a design doc for Update/Delete support in CarbonData.
> >>
> >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> >> usp=sharing
> >>
> >> Best Regards,
> >> Aniket
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

manishgupta88
Hi Vimal,

I have few queries regarding regarding the 1st suggestion.

1. Dimensions can both be dictionary and no dictionary. If we update the
dictionary file then we will have to maintain 2 flows one for dictionary
columns and 1 for no dictionary columns. Will that be ok?

2. We write dictionary files in append mode. Updating dictionary files will
be like completely rewriting the dictionary file which will also modify the
dictionary metadata and sort index file OR there is some other approach
that needs to be followed like maintaining a update delta mapping for
dictionary file.

Regards
Manish Gupta

On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
[hidden email]> wrote:

> Hi Aniket,
>
> The design looks sound and the documentation is great.
> I have few suggestions.
>
> 1) Measure update vs dimension update : In case of dimension update. for
> example user wants to change dept1 to dept2 for all users who are under
> dept1. Can we just update the dictionary for faster performance?
> 2) Update Semantics (one matching record vs multiple matching record): I
> could not understand this section. Wanted to confirm if we will support one
> update statement updating multiple rows.
>
> -Vimal
>
> On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[hidden email]>
> wrote:
>
> > Hi  Aniket
> >
> > Thanks you finished the good design documents. A couple of inputs from my
> > side:
> >
> > 1.Please add the below mentioned info(Rowid definition etc.) to design
> > documents also.
> > 2.In page6 :"Schema change operation can run in parallel with Update or
> > Delte operations, but not with another schema change operation" , can you
> > explain this item ?
> > 3.Please unify the description:  use "CarbonData" to replace "Carbon",
> > unify the description for "destination table" and "target table".
> > 4.The Update operation's delete delta is same with Delete operation's
> > delete
> > delta?
> >
> > BTW, it would be much better if you could provide google docs for review
> in
> > the next time, it is really difficult to give comment based on pdf
> > documents
> > :)
> >
> > Regards
> > Liang
> >
> > Aniket Adnaik wrote
> > > Hi Sujith,
> > >
> > > Please see my comments inline.
> > >
> > > Best Regards,
> > > Aniket
> > >
> > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
> >
> > > sujithchacko.2010@
> >
> > > &gt;
> > > wrote:
> > >
> > >> Hi Aniket,
> > >>
> > >>       Its a well documented design,  just want to know few points like
> > >>
> > >> a.  Format of the RowID and its datatype
> > >>
> > >  AA>> Following format can be used to represent a unique rowed;
> > >
> > >  [
> > > <Segment ID>
> > > <Block ID>
> > > <Blocklet ID>
> > > <Offset in Blocklet>
> > > ]
> > >  A simple way would be to use String data type and store it as a text
> > > file.
> > > However, more efficient way could be to use Bitsets/Bitmaps as further
> > > optimization. Compressed Bitmaps such as Roaring bitmaps can be used
> for
> > > better performance and efficient storage.
> > >
> > > b.  Impact of this feature in select query since every time query
> process
> > > has to exclude each deleted records and include corresponding updated
> > > record, any optimization is considered in tackling the query
> performance
> > > issue since one of the major highlights of carbon is performance.
> > > AA>> Some of the optimizations would be  to cache the deltas to avoid
> > > recurrent I/O,
> > > to store sorted rowids in delete delta for efficient lookup, and
> perform
> > > regular compaction to minimize the impact on select query performance.
> > > Additionally, we may have to explore ways to perform compaction
> > > automatically, for example, if more than 25% of rows are read from
> > deltas.
> > > Please feel free to share if you have any ideas or suggestions.
> > >
> > > Thanks,
> > > Sujith
> > >
> > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
> >
> > > aniket.adnaik@
> >
> > > &gt; wrote:
> > >
> > >> Hi All,
> > >>
> > >> Please find a design doc for Update/Delete support in CarbonData.
> > >>
> > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> > >> usp=sharing
> > >>
> > >> Best Regards,
> > >> Aniket
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
Hi Vimal,

Thanks for your suggestions.
For the 1st point, i tend to agree with Manish's comments. But, it's worth
looking into different ways to optimize the performance.
I guess, query performance may take priority over update performance.
Basically, we may need better compaction approach to merge
delta files into regular carbon files to maintain adequate performance.
For the 2nd point, CarbonData will support updating multiple rows, but not
the same row multiple times in a single update operation. It is possible
that join condition in sub-select of original update statement can result
into multiple rows from source table for the same row in the target table.
This is ambiguous condition and common ways to solve this is to error out ,
or to apply first matching row, or to apply last matching row. CarbonData
will choose to error out and let user resolve the ambiguity, which a
safer/standard choice.

Best Regards,
Aniket

On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <[hidden email]>
wrote:

> Hi Vimal,
>
> I have few queries regarding regarding the 1st suggestion.
>
> 1. Dimensions can both be dictionary and no dictionary. If we update the
> dictionary file then we will have to maintain 2 flows one for dictionary
> columns and 1 for no dictionary columns. Will that be ok?
>
> 2. We write dictionary files in append mode. Updating dictionary files will
> be like completely rewriting the dictionary file which will also modify the
> dictionary metadata and sort index file OR there is some other approach
> that needs to be followed like maintaining a update delta mapping for
> dictionary file.
>
> Regards
> Manish Gupta
>
> On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> [hidden email]> wrote:
>
> > Hi Aniket,
> >
> > The design looks sound and the documentation is great.
> > I have few suggestions.
> >
> > 1) Measure update vs dimension update : In case of dimension update. for
> > example user wants to change dept1 to dept2 for all users who are under
> > dept1. Can we just update the dictionary for faster performance?
> > 2) Update Semantics (one matching record vs multiple matching record): I
> > could not understand this section. Wanted to confirm if we will support
> one
> > update statement updating multiple rows.
> >
> > -Vimal
> >
> > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[hidden email]>
> > wrote:
> >
> > > Hi  Aniket
> > >
> > > Thanks you finished the good design documents. A couple of inputs from
> my
> > > side:
> > >
> > > 1.Please add the below mentioned info(Rowid definition etc.) to design
> > > documents also.
> > > 2.In page6 :"Schema change operation can run in parallel with Update or
> > > Delte operations, but not with another schema change operation" , can
> you
> > > explain this item ?
> > > 3.Please unify the description:  use "CarbonData" to replace "Carbon",
> > > unify the description for "destination table" and "target table".
> > > 4.The Update operation's delete delta is same with Delete operation's
> > > delete
> > > delta?
> > >
> > > BTW, it would be much better if you could provide google docs for
> review
> > in
> > > the next time, it is really difficult to give comment based on pdf
> > > documents
> > > :)
> > >
> > > Regards
> > > Liang
> > >
> > > Aniket Adnaik wrote
> > > > Hi Sujith,
> > > >
> > > > Please see my comments inline.
> > > >
> > > > Best Regards,
> > > > Aniket
> > > >
> > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
> > >
> > > > sujithchacko.2010@
> > >
> > > > &gt;
> > > > wrote:
> > > >
> > > >> Hi Aniket,
> > > >>
> > > >>       Its a well documented design,  just want to know few points
> like
> > > >>
> > > >> a.  Format of the RowID and its datatype
> > > >>
> > > >  AA>> Following format can be used to represent a unique rowed;
> > > >
> > > >  [
> > > > <Segment ID>
> > > > <Block ID>
> > > > <Blocklet ID>
> > > > <Offset in Blocklet>
> > > > ]
> > > >  A simple way would be to use String data type and store it as a text
> > > > file.
> > > > However, more efficient way could be to use Bitsets/Bitmaps as
> further
> > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be used
> > for
> > > > better performance and efficient storage.
> > > >
> > > > b.  Impact of this feature in select query since every time query
> > process
> > > > has to exclude each deleted records and include corresponding updated
> > > > record, any optimization is considered in tackling the query
> > performance
> > > > issue since one of the major highlights of carbon is performance.
> > > > AA>> Some of the optimizations would be  to cache the deltas to avoid
> > > > recurrent I/O,
> > > > to store sorted rowids in delete delta for efficient lookup, and
> > perform
> > > > regular compaction to minimize the impact on select query
> performance.
> > > > Additionally, we may have to explore ways to perform compaction
> > > > automatically, for example, if more than 25% of rows are read from
> > > deltas.
> > > > Please feel free to share if you have any ideas or suggestions.
> > > >
> > > > Thanks,
> > > > Sujith
> > > >
> > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
> > >
> > > > aniket.adnaik@
> > >
> > > > &gt; wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> Please find a design doc for Update/Delete support in CarbonData.
> > > >>
> > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> > > >> usp=sharing
> > > >>
> > > >> Best Regards,
> > > >> Aniket
> > > >>
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context: http://apache-carbondata-
> > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

kumarvishal09
Hi Aniket,

I agree with Vimal opinion, but that use case will be very less.

I have one query for this update and delete feature.
When we will start compaction after each update or delete operation?

-Regards
Kumar Vishal



On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik <[hidden email]>
wrote:

> Hi Vimal,
>
> Thanks for your suggestions.
> For the 1st point, i tend to agree with Manish's comments. But, it's worth
> looking into different ways to optimize the performance.
> I guess, query performance may take priority over update performance.
> Basically, we may need better compaction approach to merge
> delta files into regular carbon files to maintain adequate performance.
> For the 2nd point, CarbonData will support updating multiple rows, but not
> the same row multiple times in a single update operation. It is possible
> that join condition in sub-select of original update statement can result
> into multiple rows from source table for the same row in the target table.
> This is ambiguous condition and common ways to solve this is to error out ,
> or to apply first matching row, or to apply last matching row. CarbonData
> will choose to error out and let user resolve the ambiguity, which a
> safer/standard choice.
>
> Best Regards,
> Aniket
>
> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <[hidden email]>
> wrote:
>
> > Hi Vimal,
> >
> > I have few queries regarding regarding the 1st suggestion.
> >
> > 1. Dimensions can both be dictionary and no dictionary. If we update the
> > dictionary file then we will have to maintain 2 flows one for dictionary
> > columns and 1 for no dictionary columns. Will that be ok?
> >
> > 2. We write dictionary files in append mode. Updating dictionary files
> will
> > be like completely rewriting the dictionary file which will also modify
> the
> > dictionary metadata and sort index file OR there is some other approach
> > that needs to be followed like maintaining a update delta mapping for
> > dictionary file.
> >
> > Regards
> > Manish Gupta
> >
> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> > [hidden email]> wrote:
> >
> > > Hi Aniket,
> > >
> > > The design looks sound and the documentation is great.
> > > I have few suggestions.
> > >
> > > 1) Measure update vs dimension update : In case of dimension update.
> for
> > > example user wants to change dept1 to dept2 for all users who are under
> > > dept1. Can we just update the dictionary for faster performance?
> > > 2) Update Semantics (one matching record vs multiple matching record):
> I
> > > could not understand this section. Wanted to confirm if we will support
> > one
> > > update statement updating multiple rows.
> > >
> > > -Vimal
> > >
> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[hidden email]>
> > > wrote:
> > >
> > > > Hi  Aniket
> > > >
> > > > Thanks you finished the good design documents. A couple of inputs
> from
> > my
> > > > side:
> > > >
> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
> design
> > > > documents also.
> > > > 2.In page6 :"Schema change operation can run in parallel with Update
> or
> > > > Delte operations, but not with another schema change operation" , can
> > you
> > > > explain this item ?
> > > > 3.Please unify the description:  use "CarbonData" to replace
> "Carbon",
> > > > unify the description for "destination table" and "target table".
> > > > 4.The Update operation's delete delta is same with Delete operation's
> > > > delete
> > > > delta?
> > > >
> > > > BTW, it would be much better if you could provide google docs for
> > review
> > > in
> > > > the next time, it is really difficult to give comment based on pdf
> > > > documents
> > > > :)
> > > >
> > > > Regards
> > > > Liang
> > > >
> > > > Aniket Adnaik wrote
> > > > > Hi Sujith,
> > > > >
> > > > > Please see my comments inline.
> > > > >
> > > > > Best Regards,
> > > > > Aniket
> > > > >
> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
> > > >
> > > > > sujithchacko.2010@
> > > >
> > > > > &gt;
> > > > > wrote:
> > > > >
> > > > >> Hi Aniket,
> > > > >>
> > > > >>       Its a well documented design,  just want to know few points
> > like
> > > > >>
> > > > >> a.  Format of the RowID and its datatype
> > > > >>
> > > > >  AA>> Following format can be used to represent a unique rowed;
> > > > >
> > > > >  [
> > > > > <Segment ID>
> > > > > <Block ID>
> > > > > <Blocklet ID>
> > > > > <Offset in Blocklet>
> > > > > ]
> > > > >  A simple way would be to use String data type and store it as a
> text
> > > > > file.
> > > > > However, more efficient way could be to use Bitsets/Bitmaps as
> > further
> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be
> used
> > > for
> > > > > better performance and efficient storage.
> > > > >
> > > > > b.  Impact of this feature in select query since every time query
> > > process
> > > > > has to exclude each deleted records and include corresponding
> updated
> > > > > record, any optimization is considered in tackling the query
> > > performance
> > > > > issue since one of the major highlights of carbon is performance.
> > > > > AA>> Some of the optimizations would be  to cache the deltas to
> avoid
> > > > > recurrent I/O,
> > > > > to store sorted rowids in delete delta for efficient lookup, and
> > > perform
> > > > > regular compaction to minimize the impact on select query
> > performance.
> > > > > Additionally, we may have to explore ways to perform compaction
> > > > > automatically, for example, if more than 25% of rows are read from
> > > > deltas.
> > > > > Please feel free to share if you have any ideas or suggestions.
> > > > >
> > > > > Thanks,
> > > > > Sujith
> > > > >
> > > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
> > > >
> > > > > aniket.adnaik@
> > > >
> > > > > &gt; wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> Please find a design doc for Update/Delete support in CarbonData.
> > > > >>
> > > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view
> ?
> > > > >> usp=sharing
> > > > >>
> > > > >> Best Regards,
> > > > >> Aniket
> > > > >>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context: http://apache-carbondata-
> > > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> > > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> > > > Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > > > at Nabble.com.
> > > >
> > >
> >
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

kumarvishal09
HI Ankiet,

I think If update/delete is for less data then horizontal compaction can
based on user configuration, but if more data is getting updated then
better to start vertical compaction immediately , this is because we are
not physically deleting the data from disk, if more data is getting
updated(more than 60%) then during query first we will query the older data
+ exclude the deleted records+ include the update delta file data. So in
this case more data will come into memory, we can avoid this by starting
vertical compaction immediately after update/delete.

-Regards
Kumar Vishal

On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal <[hidden email]>
wrote:

> Hi Aniket,
>
> I agree with Vimal opinion, but that use case will be very less.
>
> I have one query for this update and delete feature.
> When we will start compaction after each update or delete operation?
>
> -Regards
> Kumar Vishal
>
>
>
> On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik <[hidden email]>
> wrote:
>
>> Hi Vimal,
>>
>> Thanks for your suggestions.
>> For the 1st point, i tend to agree with Manish's comments. But, it's worth
>> looking into different ways to optimize the performance.
>> I guess, query performance may take priority over update performance.
>> Basically, we may need better compaction approach to merge
>> delta files into regular carbon files to maintain adequate performance.
>> For the 2nd point, CarbonData will support updating multiple rows, but not
>> the same row multiple times in a single update operation. It is possible
>> that join condition in sub-select of original update statement can result
>> into multiple rows from source table for the same row in the target table.
>> This is ambiguous condition and common ways to solve this is to error out
>> ,
>> or to apply first matching row, or to apply last matching row. CarbonData
>> will choose to error out and let user resolve the ambiguity, which a
>> safer/standard choice.
>>
>> Best Regards,
>> Aniket
>>
>> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <[hidden email]>
>> wrote:
>>
>> > Hi Vimal,
>> >
>> > I have few queries regarding regarding the 1st suggestion.
>> >
>> > 1. Dimensions can both be dictionary and no dictionary. If we update the
>> > dictionary file then we will have to maintain 2 flows one for dictionary
>> > columns and 1 for no dictionary columns. Will that be ok?
>> >
>> > 2. We write dictionary files in append mode. Updating dictionary files
>> will
>> > be like completely rewriting the dictionary file which will also modify
>> the
>> > dictionary metadata and sort index file OR there is some other approach
>> > that needs to be followed like maintaining a update delta mapping for
>> > dictionary file.
>> >
>> > Regards
>> > Manish Gupta
>> >
>> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
>> > [hidden email]> wrote:
>> >
>> > > Hi Aniket,
>> > >
>> > > The design looks sound and the documentation is great.
>> > > I have few suggestions.
>> > >
>> > > 1) Measure update vs dimension update : In case of dimension update.
>> for
>> > > example user wants to change dept1 to dept2 for all users who are
>> under
>> > > dept1. Can we just update the dictionary for faster performance?
>> > > 2) Update Semantics (one matching record vs multiple matching
>> record): I
>> > > could not understand this section. Wanted to confirm if we will
>> support
>> > one
>> > > update statement updating multiple rows.
>> > >
>> > > -Vimal
>> > >
>> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <[hidden email]>
>> > > wrote:
>> > >
>> > > > Hi  Aniket
>> > > >
>> > > > Thanks you finished the good design documents. A couple of inputs
>> from
>> > my
>> > > > side:
>> > > >
>> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
>> design
>> > > > documents also.
>> > > > 2.In page6 :"Schema change operation can run in parallel with
>> Update or
>> > > > Delte operations, but not with another schema change operation" ,
>> can
>> > you
>> > > > explain this item ?
>> > > > 3.Please unify the description:  use "CarbonData" to replace
>> "Carbon",
>> > > > unify the description for "destination table" and "target table".
>> > > > 4.The Update operation's delete delta is same with Delete
>> operation's
>> > > > delete
>> > > > delta?
>> > > >
>> > > > BTW, it would be much better if you could provide google docs for
>> > review
>> > > in
>> > > > the next time, it is really difficult to give comment based on pdf
>> > > > documents
>> > > > :)
>> > > >
>> > > > Regards
>> > > > Liang
>> > > >
>> > > > Aniket Adnaik wrote
>> > > > > Hi Sujith,
>> > > > >
>> > > > > Please see my comments inline.
>> > > > >
>> > > > > Best Regards,
>> > > > > Aniket
>> > > > >
>> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
>> > > >
>> > > > > sujithchacko.2010@
>> > > >
>> > > > > &gt;
>> > > > > wrote:
>> > > > >
>> > > > >> Hi Aniket,
>> > > > >>
>> > > > >>       Its a well documented design,  just want to know few points
>> > like
>> > > > >>
>> > > > >> a.  Format of the RowID and its datatype
>> > > > >>
>> > > > >  AA>> Following format can be used to represent a unique rowed;
>> > > > >
>> > > > >  [
>> > > > > <Segment ID>
>> > > > > <Block ID>
>> > > > > <Blocklet ID>
>> > > > > <Offset in Blocklet>
>> > > > > ]
>> > > > >  A simple way would be to use String data type and store it as a
>> text
>> > > > > file.
>> > > > > However, more efficient way could be to use Bitsets/Bitmaps as
>> > further
>> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be
>> used
>> > > for
>> > > > > better performance and efficient storage.
>> > > > >
>> > > > > b.  Impact of this feature in select query since every time query
>> > > process
>> > > > > has to exclude each deleted records and include corresponding
>> updated
>> > > > > record, any optimization is considered in tackling the query
>> > > performance
>> > > > > issue since one of the major highlights of carbon is performance.
>> > > > > AA>> Some of the optimizations would be  to cache the deltas to
>> avoid
>> > > > > recurrent I/O,
>> > > > > to store sorted rowids in delete delta for efficient lookup, and
>> > > perform
>> > > > > regular compaction to minimize the impact on select query
>> > performance.
>> > > > > Additionally, we may have to explore ways to perform compaction
>> > > > > automatically, for example, if more than 25% of rows are read from
>> > > > deltas.
>> > > > > Please feel free to share if you have any ideas or suggestions.
>> > > > >
>> > > > > Thanks,
>> > > > > Sujith
>> > > > >
>> > > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
>> > > >
>> > > > > aniket.adnaik@
>> > > >
>> > > > > &gt; wrote:
>> > > > >
>> > > > >> Hi All,
>> > > > >>
>> > > > >> Please find a design doc for Update/Delete support in CarbonData.
>> > > > >>
>> > > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU
>> /view?
>> > > > >> usp=sharing
>> > > > >>
>> > > > >> Best Regards,
>> > > > >> Aniket
>> > > > >>
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > View this message in context: http://apache-carbondata-
>> > > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
>> > > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
>> > > > Sent from the Apache CarbonData Mailing List archive mailing list
>> > archive
>> > > > at Nabble.com.
>> > > >
>> > >
>> >
>>
>
>
kumar vishal
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Aniket Adnaik
Hi Kumar Vishal,

Yes, valid point. And there have been thoughts about it, there is lot of
scope for optimization of compaction strategies. We may even consider
background monitor process(or cron job or similar) to monitor and trigger
compaction automatically in future.

Best Regards,
Aniket

On Thu, Nov 24, 2016 at 1:32 AM, Kumar Vishal <[hidden email]>
wrote:

> HI Ankiet,
>
> I think If update/delete is for less data then horizontal compaction can
> based on user configuration, but if more data is getting updated then
> better to start vertical compaction immediately , this is because we are
> not physically deleting the data from disk, if more data is getting
> updated(more than 60%) then during query first we will query the older data
> + exclude the deleted records+ include the update delta file data. So in
> this case more data will come into memory, we can avoid this by starting
> vertical compaction immediately after update/delete.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal <[hidden email]>
> wrote:
>
> > Hi Aniket,
> >
> > I agree with Vimal opinion, but that use case will be very less.
> >
> > I have one query for this update and delete feature.
> > When we will start compaction after each update or delete operation?
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> > On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik <[hidden email]
> >
> > wrote:
> >
> >> Hi Vimal,
> >>
> >> Thanks for your suggestions.
> >> For the 1st point, i tend to agree with Manish's comments. But, it's
> worth
> >> looking into different ways to optimize the performance.
> >> I guess, query performance may take priority over update performance.
> >> Basically, we may need better compaction approach to merge
> >> delta files into regular carbon files to maintain adequate performance.
> >> For the 2nd point, CarbonData will support updating multiple rows, but
> not
> >> the same row multiple times in a single update operation. It is possible
> >> that join condition in sub-select of original update statement can
> result
> >> into multiple rows from source table for the same row in the target
> table.
> >> This is ambiguous condition and common ways to solve this is to error
> out
> >> ,
> >> or to apply first matching row, or to apply last matching row.
> CarbonData
> >> will choose to error out and let user resolve the ambiguity, which a
> >> safer/standard choice.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <
> [hidden email]>
> >> wrote:
> >>
> >> > Hi Vimal,
> >> >
> >> > I have few queries regarding regarding the 1st suggestion.
> >> >
> >> > 1. Dimensions can both be dictionary and no dictionary. If we update
> the
> >> > dictionary file then we will have to maintain 2 flows one for
> dictionary
> >> > columns and 1 for no dictionary columns. Will that be ok?
> >> >
> >> > 2. We write dictionary files in append mode. Updating dictionary files
> >> will
> >> > be like completely rewriting the dictionary file which will also
> modify
> >> the
> >> > dictionary metadata and sort index file OR there is some other
> approach
> >> > that needs to be followed like maintaining a update delta mapping for
> >> > dictionary file.
> >> >
> >> > Regards
> >> > Manish Gupta
> >> >
> >> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> >> > [hidden email]> wrote:
> >> >
> >> > > Hi Aniket,
> >> > >
> >> > > The design looks sound and the documentation is great.
> >> > > I have few suggestions.
> >> > >
> >> > > 1) Measure update vs dimension update : In case of dimension update.
> >> for
> >> > > example user wants to change dept1 to dept2 for all users who are
> >> under
> >> > > dept1. Can we just update the dictionary for faster performance?
> >> > > 2) Update Semantics (one matching record vs multiple matching
> >> record): I
> >> > > could not understand this section. Wanted to confirm if we will
> >> support
> >> > one
> >> > > update statement updating multiple rows.
> >> > >
> >> > > -Vimal
> >> > >
> >> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <
> [hidden email]>
> >> > > wrote:
> >> > >
> >> > > > Hi  Aniket
> >> > > >
> >> > > > Thanks you finished the good design documents. A couple of inputs
> >> from
> >> > my
> >> > > > side:
> >> > > >
> >> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
> >> design
> >> > > > documents also.
> >> > > > 2.In page6 :"Schema change operation can run in parallel with
> >> Update or
> >> > > > Delte operations, but not with another schema change operation" ,
> >> can
> >> > you
> >> > > > explain this item ?
> >> > > > 3.Please unify the description:  use "CarbonData" to replace
> >> "Carbon",
> >> > > > unify the description for "destination table" and "target table".
> >> > > > 4.The Update operation's delete delta is same with Delete
> >> operation's
> >> > > > delete
> >> > > > delta?
> >> > > >
> >> > > > BTW, it would be much better if you could provide google docs for
> >> > review
> >> > > in
> >> > > > the next time, it is really difficult to give comment based on pdf
> >> > > > documents
> >> > > > :)
> >> > > >
> >> > > > Regards
> >> > > > Liang
> >> > > >
> >> > > > Aniket Adnaik wrote
> >> > > > > Hi Sujith,
> >> > > > >
> >> > > > > Please see my comments inline.
> >> > > > >
> >> > > > > Best Regards,
> >> > > > > Aniket
> >> > > > >
> >> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko &lt;
> >> > > >
> >> > > > > sujithchacko.2010@
> >> > > >
> >> > > > > &gt;
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hi Aniket,
> >> > > > >>
> >> > > > >>       Its a well documented design,  just want to know few
> points
> >> > like
> >> > > > >>
> >> > > > >> a.  Format of the RowID and its datatype
> >> > > > >>
> >> > > > >  AA>> Following format can be used to represent a unique rowed;
> >> > > > >
> >> > > > >  [
> >> > > > > <Segment ID>
> >> > > > > <Block ID>
> >> > > > > <Blocklet ID>
> >> > > > > <Offset in Blocklet>
> >> > > > > ]
> >> > > > >  A simple way would be to use String data type and store it as a
> >> text
> >> > > > > file.
> >> > > > > However, more efficient way could be to use Bitsets/Bitmaps as
> >> > further
> >> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be
> >> used
> >> > > for
> >> > > > > better performance and efficient storage.
> >> > > > >
> >> > > > > b.  Impact of this feature in select query since every time
> query
> >> > > process
> >> > > > > has to exclude each deleted records and include corresponding
> >> updated
> >> > > > > record, any optimization is considered in tackling the query
> >> > > performance
> >> > > > > issue since one of the major highlights of carbon is
> performance.
> >> > > > > AA>> Some of the optimizations would be  to cache the deltas to
> >> avoid
> >> > > > > recurrent I/O,
> >> > > > > to store sorted rowids in delete delta for efficient lookup, and
> >> > > perform
> >> > > > > regular compaction to minimize the impact on select query
> >> > performance.
> >> > > > > Additionally, we may have to explore ways to perform compaction
> >> > > > > automatically, for example, if more than 25% of rows are read
> from
> >> > > > deltas.
> >> > > > > Please feel free to share if you have any ideas or suggestions.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Sujith
> >> > > > >
> >> > > > > On Nov 20, 2016 9:24 PM, "Aniket Adnaik" &lt;
> >> > > >
> >> > > > > aniket.adnaik@
> >> > > >
> >> > > > > &gt; wrote:
> >> > > > >
> >> > > > >> Hi All,
> >> > > > >>
> >> > > > >> Please find a design doc for Update/Delete support in
> CarbonData.
> >> > > > >>
> >> > > > >> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU
> >> /view?
> >> > > > >> usp=sharing
> >> > > > >>
> >> > > > >> Best Regards,
> >> > > > >> Aniket
> >> > > > >>
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > View this message in context: http://apache-carbondata-
> >> > > > mailing-list-archive.1130556.n5.nabble.com/Feature-Design-
> >> > > > Document-for-Update-Delete-support-in-CarbonData-tp3043p3093.html
> >> > > > Sent from the Apache CarbonData Mailing List archive mailing list
> >> > archive
> >> > > > at Nabble.com.
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Feature ]Design Document for Update/Delete support in CarbonData

Jacky Li
Hi Aniket,

Yes, background monitor process is preferred in the future. And there are other places need this process already, like refreshing the caches in driver and executors. Currently, dictionary caches and index caches are refreshed by checking timestamp in every query, which introduces unnecessary overhead in query flow and impact NameNode in concurrent query scenario.

Regards,
Jacky