Time travel/versioning on carbondata.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Time travel/versioning on carbondata.

ravipesala
Hi All,

CarbonData allows to store the data incrementally and do the Update/Delete
operations on the stored data. But the user always can access the latest
state of data at that point of time.

In the current system, it is not possible to access the old version of
data. And it is not possible to rollback to the old version in case of some
issues in current version data.

This proposal adds the automatic versioning of data that we store and we
can access any historical version of that data.


The design is attached on the jira
https://issues.apache.org/jira/browse/CARBONDATA-3500 Please check it.

--
Thanks & Regards,
Ravindra.
Reply | Threaded
Open this post in threaded view
|

Re: Time travel/versioning on carbondata.

chetdb

Hi Ravindra,

1. Table status file for each transaction can be named based on timestamp.
2. How is the state of the data(store) being maintained if the user keeps reverting data to different transaction points frequently. Will the query operations take more time.
3. If clean files should wait till the transaction cleanup for removing the compacted segments it can be suggested to the user to configure smaller transaction retention value.
4. If there are large no of segments already present and the IUD , alter table and delete segment operations are required to open and close the transactions wont this impact the performance of these operations.
5. Suppose a user has created a carbon store in older carbon version and decides to copy the store or directly upgrade to the latest carbon version would this transaction feature be supported for the older carbon version store or data.

Regards
Chetan
On 2019/08/23 12:37:30, Ravindra Pesala <[hidden email]> wrote:

> Hi All,
>
> CarbonData allows to store the data incrementally and do the Update/Delete
> operations on the stored data. But the user always can access the latest
> state of data at that point of time.
>
> In the current system, it is not possible to access the old version of
> data. And it is not possible to rollback to the old version in case of some
> issues in current version data.
>
> This proposal adds the automatic versioning of data that we store and we
> can access any historical version of that data.
>
>
> The design is attached on the jira
> https://issues.apache.org/jira/browse/CARBONDATA-3500 Please check it.
>
> --
> Thanks & Regards,
> Ravindra.
>
Reply | Threaded
Open this post in threaded view
|

Re: Time travel/versioning on carbondata.

kunalkapoor
In reply to this post by ravipesala
Hi Ravindra,
I have some questions regarding the feature:

1. *What would be the behaviour if the user just fires a 'select * from
table'(non-transaction query)?*
    Would we still read the transaction file to get the latest tablestatus
file name or would we keep the latest transaction in cache.
    My concern is that this may impact the query performance as the
transaction file grows.

2. *Would the user be able to create child datamaps like 'preaggregate',
'mv', 'bloom' with some transaction id?*
     ex: create datamap dm1 using 'mv' as select * from maintable where
'somefilter' @YYYYmmDDHHmmSS
     *I think this scenario can be blocked.*

3.* Would the user be able to clean the retention files using 'clean files'
or some new DDL would be exposed for transaction cleanup?*

4. *Impacted areas should include index server as the transaction details
have to be send to the Server for pruning.*

5. *Impacted Area Point 8: 'Alter table operations should open and close
the transaction'*
    Does this mean that for each alter operation a transaction entry would
be maintained and the user can query the old schema by specifying the
transaction id before that      operation? If yes then would multiple
versions of schema files be maintained?

6. *Can the user travel both ways in the revert command?*
     First reset to an old transaction id and then come back to the latest
ID?


+1 for not removing the compacted segments immediately to maintain
transaction history.

Thanks
Kunal Kapoor


On Fri, Aug 23, 2019 at 6:08 PM Ravindra Pesala <[hidden email]>
wrote:

> Hi All,
>
> CarbonData allows to store the data incrementally and do the Update/Delete
> operations on the stored data. But the user always can access the latest
> state of data at that point of time.
>
> In the current system, it is not possible to access the old version of
> data. And it is not possible to rollback to the old version in case of some
> issues in current version data.
>
> This proposal adds the automatic versioning of data that we store and we
> can access any historical version of that data.
>
>
> The design is attached on the jira
> https://issues.apache.org/jira/browse/CARBONDATA-3500 Please check it.
>
> --
> Thanks & Regards,
> Ravindra.
>
Reply | Threaded
Open this post in threaded view
|

Re: Time travel/versioning on carbondata.

akashnilugal@gmail.com
In reply to this post by ravipesala
Hi Ravindra,

I have some doubts and suggestion,

1. Since for compaction, you are suggesting to keep the compacted segments as it is, it will be applicable for delete segment by id or date operation also right?

2. Since there is a proposal for moving delete delta file data to segment file, and since we have one segment file for one segment, if we do a multiple delete operation on a segment, then for each transaction multiple segment file is generated or in the same segment file, the transaction id of each delete operation is maintained?

So,

   (i). if there will be multiple segment files for each delete operation transaction, any store structure gonna change?

   (ii). I suggest to keep may be a transaction id of delete operation as a key in segment file,     which has multiple values which are basically multiple delete delta values. Which can help in reduce IO on segment file.

3. Select * from table1@tyyyyMMDDHHmmSS, will give the data at this point if transaction right? and will there be any support to get data between two transactions? basically it will be like getting delta data between two transaction. may be there won't be any use case, just a doubt.

4. Currently we have MAX_QUERY_EXECUTION_TIME which will clean the segments after the timeout, so since we are deciding to keep the compacted segments as it is without cleaning, mauy be we need to eliminate this property and introduce the transaction timeout for clean up, else exiting default may be increased.


Regards,
Akash R Nilugal

On 2019/08/23 12:37:30, Ravindra Pesala <[hidden email]> wrote:

> Hi All,
>
> CarbonData allows to store the data incrementally and do the Update/Delete
> operations on the stored data. But the user always can access the latest
> state of data at that point of time.
>
> In the current system, it is not possible to access the old version of
> data. And it is not possible to rollback to the old version in case of some
> issues in current version data.
>
> This proposal adds the automatic versioning of data that we store and we
> can access any historical version of that data.
>
>
> The design is attached on the jira
> https://issues.apache.org/jira/browse/CARBONDATA-3500 Please check it.
>
> --
> Thanks & Regards,
> Ravindra.
>