Apache CarbonData Dev Mailing List archive

DDL for CarbonData table backup and recovery (new feature)

Classic

List

Threaded

8 messages Options

mohdshahidkhan

DDL for CarbonData table backup and recovery (new feature)

Hi Dev,

*Please find initial solution.*

*CarbonData table backup and recovery*

*Background*

Customer has created one CarbonData table which is already loaded very huge
data, and now they install another cluster which want to use the same data
as this table and don’t want load again, because load data cost long time,
so they want can directly backup this table data and recover it in another
cluster. After recovery the data in the CarbonData user can use it as a
normal CarbonData table.

*Requirement Description*

A CarbonData table’s data can support backup the data and recover the data
which no need load data again.

To reuse the CarbonData table of another cluster a DDL should be provided
to create the CarbonData table from the existing carbon table schema.

*Solution*

Currently CarbonData has below three types of tables

1. Normal table

2. Pre Aggregate table

CarbonData should provide a DDL command to create the table from existing
table data.
Below DDL command could be used to create the table from existing table
data.

* REGISTER TABLES FROM $dbPath*

i. The database path will be scanned to get all table schemas.

ii. The schema will be read to get the database name, table name
and columns details.

iii. The *table will be registered to the hive catalog with
below details*

*CREATE TABLE $tbName USING carbondata OPTIONS (tableName
"$dbName.$tbName",*

*dbName "$dbName",*

*tablePath "$tablePath",*

*path "$tablePath"** )*

*Precondition**:*

i. Before executing this command the old table schema and data
should be copied into the new store location.

ii. If the table is aggregate table then all the aggregate tables
should be copied to the new store location.

*Validation:*

1. If database does not exist then the registration will fail.
2. The table will be registered only if same table name is not already
registered.
3. If the table contains the aggregate tables then all the aggregate
tables should be registered to hive catalog and if any the aggregate
table does not exist then the table creation operation should fail.

Regards,

Shahid

Naresh P R

Re: DDL for CarbonData table backup and recovery (new feature)

Hi Shahid,

Can the new DDL be similar to Import / Export Syntax
eg.,
EXPORT TABLE tablename TO 'export_target_path' -- Export actual table &
associated agg tables as zip file

IMPORT [TABLE tablename] FROM 'source_path' -- Import data from zip
file to "carbon store path" & register the table as mentioned in your
mail, tablename can be optional in this case.

==> If tablename is not mentioned or mentioned table does not exist,
we can assume table does not exist & need to create it

==> If tablename is mentioned & table exist, then we can assume it as
incremental data update or schema evolution.

==> We can validate existing files checksum against new files &
overwrite/remove stale files

==> If schema update happened, then we can update the schema into
the metastore same way as we are doing for add/drop column commands.

I think all newer carbondata versions are backward compatible, any
restrictions or thoughts on cross version import export ?
---
Regards,
Naresh P R

On Thu, Nov 23, 2017 at 4:47 PM, Mohammad Shahid Khan <
[hidden email]> wrote:

> Hi Dev,
>
> *Please find initial solution.*
>
>
> *CarbonData table backup and recovery*
>
> *Background*
>
> Customer has created one CarbonData table which is already loaded very huge
> data, and now they install another cluster which want to use the same data
> as this table and don’t want load again, because load data cost long time,
> so they want can directly backup this table data and recover it in another
> cluster. After recovery the data in the CarbonData user can use it as a
> normal CarbonData table.
>
> *Requirement Description*
>
> A CarbonData table’s data can support backup the data and recover the data
> which no need load data again.
>
> To reuse the CarbonData table of another cluster a DDL should be provided
> to create the CarbonData table from the existing carbon table schema.
>
> *Solution*
>
> Currently CarbonData has below three types of tables
>
> 1. Normal table
>
> 2. Pre Aggregate table
>
> CarbonData should provide a DDL command to create the table from existing
> table data.
> Below DDL command could be used to create the table from existing table
> data.
>
> * REGISTER TABLES FROM $dbPath*
>
>
>
> i. The database path will be scanned to get all table schemas.
>
> ii. The schema will be read to get the database name, table name
> and columns details.
>
> iii. The *table will be registered to the hive catalog with
> below details*
>
> *CREATE TABLE $tbName USING carbondata OPTIONS (tableName
> "$dbName.$tbName",*
>
> *dbName "$dbName",*
>
> *tablePath "$tablePath",*
>
> *path "$tablePath"** )*
>
>
> *Precondition**:*
>
> i. Before executing this command the old table schema and data
> should be copied into the new store location.
>
> ii. If the table is aggregate table then all the aggregate tables
> should be copied to the new store location.
>
>
>
> *Validation:*
>
>
> 1. If database does not exist then the registration will fail.
> 2. The table will be registered only if same table name is not already
> registered.
> 3. If the table contains the aggregate tables then all the aggregate
> tables should be registered to hive catalog and if any the aggregate
> table does not exist then the table creation operation should fail.
>
> Regards,
>
> Shahid
>

mohdshahidkhan

Re: DDL for CarbonData table backup and recovery (new feature)

Hi Naresh,
Hive export export the meta data as well as the table data also.
We do not want to export the table data as it will tedious for TB's of
data.
We have table and table data in the store location but the table is not
register with hive metastore.

Regards,
Shahid

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Naresh P R

Re: DDL for CarbonData table backup and recovery (new feature)

Thanks for the clarification Shahid. Much Appreciated.

Actually if the export command is on CarbonData table, we can just zip the
actual table folder & associated agg table folders into user mentioned
location. It dont export Metadata
Copy data from 1 cluster to other will still remain same in your approach
also.

After copying data into new cluster, how to synchronize incremental loads
or schema evolution from old cluster to new cluster ?
should we need to drop the table in new cluster, copy the data from old
cluster to new cluster & recreate table again ?

I think creating carbondata table requires schema information also to be
passed.
CREATE TABLE $dbName.$tbName (${ fields.map(f => f.rawSchema).mkString(",")
}) USING CARBONDATA OPTIONS (tableName "$tbName", dbName "$dbName",
tablePath "$tablePath")
---
Regards,
Naresh P R

On Fri, Nov 24, 2017 at 10:02 AM, mohdshahidkhan <
[hidden email]> wrote:

> Hi Naresh,
> Hive export export the meta data as well as the table data also.
> We do not want to export the table data as it will tedious for TB's of
> data.
> We have table and table data in the store location but the table is not
> register with hive metastore.
>
> Regards,
> Shahid
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

mohdshahidkhan

Re: DDL for CarbonData table backup and recovery (new feature)

This post was updated on .

In reply to this post by mohdshahidkhan

*Please update solution:
Instead of passing the dbLocation, the database name will be passed in the
Register DDL*
CarbonData table backup and recovery
Background
Customer has created one CarbonData table which is already loaded very huge
data, and now they install another cluster which want to use the same data
as this table and don’t want load again, because load data cost long time,
so they want can directly backup this table data and recover it in another
cluster. After recovery the data in the CarbonData user can use it as a
normal CarbonData table.
Requirement Description
A CarbonData table’s data can support backup the data and recover the data
which no need load data again.
To reuse the CarbonData table of another cluster a DDL should be provided to
create the CarbonData table from the existing carbon table schema.
Solution
Currently CarbonData has below three types of tables
1. Normal table
2. Pre Aggregate table
CarbonData should provide a DDL command to create the table from existing
table data.
Below DDL command could be used to create the table from existing table
data.

REGISTER TABLES FOR DATABASE $DBName;

i. The database path will be retrived from hive catalog &
The database path will be scanned to get all table.
ii. The table schema will be read to get columns details.
iii. The table will be registered to the hive catalog with below
details
CREATE TABLE $tbName USING carbondata OPTIONS (tableName "$dbName.$tbName",
dbName "$dbName",
tablePath "$tablePath",
path "$tablePath" )

Precondition:
i. The user has to create the database and Before executing this command
the old table schema and
data should be copied into the database location.
ii. If the table is aggregate table then all the aggregate tables should
be copied to the in database
location .

Validation:
1. If database does not exist then the registration will fail.
2. The table will be registered only if same table name is not already
registered.
3. If the table contains the aggregate tables then all the aggregate
tables should be registered to hive
catalog and if any of the aggregate table does not exist then the
table creation operation should fail.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

mohdshahidkhan

Re: DDL for CarbonData table backup and recovery (new feature)

In reply to this post by Naresh P R

Thanks for the clarification Naresh.
Please find my answer.

Actually if the export command is on CarbonData table, we can just zip the
actual table folder & associated agg table folders into user mentioned
location. It dont export Metadata
Copy data from 1 cluster to other will still remain same in your approach
also.
Agree, we don't want the export data, its simply user has the tables from
the previous cluster
and want to use them, so to use that he has register with the hive.

After copying data into new cluster, how to synchronize incremental loads
or schema evolution from old cluster to new cluster ?
should we need to drop the table in new cluster, copy the data from old
cluster to new cluster & recreate table again ?
A. synch from old to new is not is scope.

I think creating carbondata table requires schema information also to be
passed.
CREATE TABLE $dbName.$tbName (${ fields.map(f => f.rawSchema).mkString(",")
}) USING CARBONDATA OPTIONS (tableName "$tbName", dbName "$dbName",
tablePath "$tablePath")
A. agree will take the same.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

mohdshahidkhan

Re: DDL for CarbonData table backup and recovery (new feature)

In reply to this post by mohdshahidkhan

Hi Ravindra & Likun
I am freezing the design and going to start the code.
Please revert me if any issues.

--Regards,
Shahid

On Fri, Nov 24, 2017 at 12:40 PM, mohdshahidkhan <
[hidden email]> wrote:

> *Please update solution:
> Instead of passing the dbLocation, the database name will be passed in the
> Register DDL*
> CarbonData table backup and recovery
> Background
> Customer has created one CarbonData table which is already loaded very huge
> data, and now they install another cluster which want to use the same data
> as this table and don’t want load again, because load data cost long time,
> so they want can directly backup this table data and recover it in another
> cluster. After recovery the data in the CarbonData user can use it as a
> normal CarbonData table.
> Requirement Description
> A CarbonData table’s data can support backup the data and recover the data
> which no need load data again.
> To reuse the CarbonData table of another cluster a DDL should be provided
> to
> create the CarbonData table from the existing carbon table schema.
> Solution
> Currently CarbonData has below three types of tables
> 1. Normal table
> 2. Pre Aggregate table
> CarbonData should provide a DDL command to create the table from existing
> table data.
> Below DDL command could be used to create the table from existing table
> data.
>
> REGISTER TABLES FOR $DBName;
>
> i. The database path will be retrived from hive catalog &
> The database path will be scanned to get all table.
> ii. The table schema will be read to get columns details.
> iii. The table will be registered to the hive catalog with below
> details
> CREATE TABLE $tbName USING carbondata OPTIONS (tableName "$dbName.$tbName",
> dbName "$dbName",
> tablePath "$tablePath",
> path "$tablePath" )
>
> Precondition:
> i. The user has to create the database and Before executing this
> command
> the old table schema and
> data should be copied into the database location.
> ii. If the table is aggregate table then all the aggregate tables should
> be copied to the in database
> location .
>
> Validation:
> 1. If database does not exist then the registration will fail.
> 2. The table will be registered only if same table name is not already
> registered.
> 3. If the table contains the aggregate tables then all the aggregate
> tables should be registered to hive
> catalog and if any of the aggregate table does not exist then the
> table creation operation should fail.
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>

mohdshahidkhan

Re: DDL for CarbonData table backup and recovery (new feature)

Hi Dev,
The table level registration should be also be supported.'
-- Register the carbon tables at table level:

*REGISTER TABLE $tbName;*

Use case:
If user has 10 tables but wants to register only 2 or 3 table not all.

Regards,
Shahid

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/