Apache CarbonData Dev Mailing List archive

[DISCUSSION] Support Database Location Configuration while Creating Database

Classic

List

Threaded

8 messages Options

mohdshahidkhan

[DISCUSSION] Support Database Location Configuration while Creating Database

Hi Dev,

Please find the design document for Support Database Location Configuration while Creating Database.

Regards,

Shahid

Support Database Location.docx (22K) Download Attachment

Jacky Li

Re: [DISCUSSION] Support Database Location Configuration while Creating Database

Hi,

What carbon provides are two level concepts:
1. File format, which can be used by compute engine to write and read data. CarbonData is a self-describing and type-aware columnar file format for Hadoop environement which is just as what orc, parquet provides.

2. Table level storage, which include not just file format but also aggregated index file (datamap), global dictionary, and segment metadata. It provides more functionality regarding segment management and SQL optimization (like lazy decode) through deep integration with compute engine (currently only spark deep integration is supported).

In my opinion, these two level abstraction are the core of carbondata project. But the database concept should be of compute engine which managing the store level metadata, since spark, hive, presto they all have these part in their layer.

I think what currently carbondata missing is that for table level storage, user should be enable to specify the table location to save the table data. This is to achieve:
1. Compute engine can manage the carbon table location the same way as ORC and parquet table. And user uses same API or SQL syntax to create carbondata table, like `df.format(“carbondata”).save(“path”) ` using spark dataframe API. There should be no carbon storePath involved.

2. User should be able to save table in HDFS location or S3 location in the same context. Since there are several carbon property involved when determining the FS type, such as LOCK file, etc, it is not possible to create tables on HDFS and on S3 in same context, which also break the table level abstraction.

Regards,
Jacky

> 在 2017年10月3日，下午10:36，Mohammad Shahid Khan <[hidden email]> 写道：
>
> Hi Dev,
> Please find the design document for Support Database Location Configuration while Creating Database.
>
> Regards,
> Shahid
> <Support Database Location.docx>

mohdshahidkhan

Re: [DISCUSSION] Support Database Location Configuration while Creating Database

n case of spark the where to write the table content is decided in two ways.

1. If a table is created under a database for which loaction attribute is
not configured by the end user,
then the content of table will be written in the
"spark.sql.warehouse.dir".
For example: if spark.sql.warehouse.dir = /opt/hive/warehouse/
then table content will be writetn at /opt/hive/warehouse/

2. But if while database creation the location attribute is set by the user
then
the table created under such database the content of table will be
written to the configured location.
For example if for database x user set location '/user/cutom/warehouse'

then table created under the database x will be writen at
'/user/cutom/warehouse'

Currently for carbon we have cutomized the store writting and always
writting to the
fixed path i.e. 'spark.sql.warehouse.dir'.

This is to address the same issue.

Q 1. Compute engine can manage the carbon table location the same way as
ORC and parquet table.
And user uses same API or SQL syntax to create carbondata table, like
`df.format(“carbondata”).save(“path”) `
using spark dataframe API. There should be no carbon storePath involved.

A. I think this differnt requirement it does not even consider the
database and table. This is something the table content
will be written at the desired location in the specified format.

Q 2. User should be able to save table in HDFS location or S3 location in
the same context.
Since there are several carbon property involved when determining the FS
type, such as LOCK file,
etc, it is not possible to create tables on HDFS and on S3 in same
context, which also break the table level abstraction.

A. This requirement is for viewfs file system, where differnt database
can lie in differnt nameservices.

On Wed, Oct 4, 2017 at 7:09 PM, Jacky Li <[hidden email]> wrote:

> Hi,
>
> What carbon provides are two level concepts:
> 1. File format, which can be used by compute engine to write and read
> data. CarbonData is a self-describing and type-aware columnar file format
> for Hadoop environement which is just as what orc, parquet provides.
>
> 2. Table level storage, which include not just file format but also
> aggregated index file (datamap), global dictionary, and segment metadata.
> It provides more functionality regarding segment management and SQL
> optimization (like lazy decode) through deep integration with compute
> engine (currently only spark deep integration is supported).
>
> In my opinion, these two level abstraction are the core of carbondata
> project. But the database concept should be of compute engine which
> managing the store level metadata, since spark, hive, presto they all have
> these part in their layer.
>
> I think what currently carbondata missing is that for table level storage,
> user should be enable to specify the table location to save the table data.
> This is to achieve:
> 1. Compute engine can manage the carbon table location the same way as ORC
> and parquet table. And user uses same API or SQL syntax to create
> carbondata table, like `df.format(“carbondata”).save(“path”) ` using
> spark dataframe API. There should be no carbon storePath involved.
>
> 2. User should be able to save table in HDFS location or S3 location in
> the same context. Since there are several carbon property involved when
> determining the FS type, such as LOCK file, etc, it is not possible to
> create tables on HDFS and on S3 in same context, which also break the table
> level abstraction.
>
> Regards,
> Jacky
>
> > 在 2017年10月3日，下午10:36，Mohammad Shahid Khan <[hidden email]>
> 写道：
> >
> > Hi Dev,
> > Please find the design document for Support Database Location
> Configuration while Creating Database.
> >
> > Regards,
> > Shahid
> > <Support Database Location.docx>
>
>
>
>

mohdshahidkhan

Re: [DISCUSSION] Support Database Location Configuration while Creating Database

In reply to this post by mohdshahidkhan

Thank you for the clarification.

On 6 Oct 2017 05:08, "Jacky Li" <[hidden email]> wrote:

> I mean, either spark.sql.warehouse.dir is set by user,
> the carbon core should not be aware of database and related
> path construction logic should be kept in spark or
> spark-integration module only. We should achieve that
> inside carbon, it should only know the upper layer is
> specified a table location to write table data.
>
> All database concept and commands should be managed by
> upper layer. This is not conflicting with your requirement.
>
> Regards,
> Jacky
>
> 发自坚果 Pro
> Mohammad Shahid Khan <[hidden email]> 于 2017年10月5日
> 下午11:56写道：
>
> n case of spark the where to write the table content is decided in two
> ways.
>
> 1. If a table is created under a database for which loaction attribute is
> not configured by the end user,
> then the content of table will be written in the
> "spark.sql.warehouse.dir".
> For example: if spark.sql.warehouse.dir = /opt/hive/warehouse/
> then table content will be writetn at /opt/hive/warehouse/
>
> 2. But if while database creation the location attribute is set by the
> user
> then
> the table created under such database the content of table will be
> written to the configured location.
> For example if for database x user set location '/user/cutom/warehouse'
>
> then table created under the database x will be writen at
> '/user/cutom/warehouse'
>
>
> Currently for carbon we have cutomized the store writting and always
> writting to the
> fixed path i.e. 'spark.sql.warehouse.dir'.
>
> This is to address the same issue.
>
>
>
> Q 1. Compute engine can manage the carbon table location the same way as
> ORC and parquet table.
> And user uses same API or SQL syntax to create carbondata table, like
> `df.format(“carbondata”).save(“path”) `
> using spark dataframe API. There should be no carbon storePath
> involved.
>
> A. I think this differnt requirement it does not even consider the
> database and table. This is something the table content
> will be written at the desired location in the specified format.
>
> Q 2. User should be able to save table in HDFS location or S3 location
> in
> the same context.
> Since there are several carbon property involved when determining the
> FS
> type, such as LOCK file,
> etc, it is not possible to create tables on HDFS and on S3 in same
> context, which also break the table level abstraction.
>
> A. This requirement is for viewfs file system, where differnt database
> can lie in differnt nameservices.
>
>
> On Wed, Oct 4, 2017 at 7:09 PM, Jacky Li <[hidden email]> wrote:
>
> > Hi,
> >
> > What carbon provides are two level concepts:
> > 1. File format, which can be used by compute engine to write and read
> > data. CarbonData is a self-describing and type-aware columnar file
> format
> > for Hadoop environement which is just as what orc, parquet provides.
> >
> > 2. Table level storage, which include not just file format but also
> > aggregated index file (datamap), global dictionary, and segment
> metadata.
> > It provides more functionality regarding segment management and SQL
> > optimization (like lazy decode) through deep integration with compute
> > engine (currently only spark deep integration is supported).
> >
> > In my opinion, these two level abstraction are the core of carbondata
> > project. But the database concept should be of compute engine which
> > managing the store level metadata, since spark, hive, presto they all
> have
> > these part in their layer.
> >
> > I think what currently carbondata missing is that for table level
> storage,
> > user should be enable to specify the table location to save the table
> data.
> > This is to achieve:
> > 1. Compute engine can manage the carbon table location the same way as
> ORC
> > and parquet table. And user uses same API or SQL syntax to create
> > carbondata table, like `df.format(“carbondata”).save(“path”) ` using
> > spark dataframe API. There should be no carbon storePath involved.
> >
> > 2. User should be able to save table in HDFS location or S3 location in
> > the same context. Since there are several carbon property involved when
> > determining the FS type, such as LOCK file, etc, it is not possible to
> > create tables on HDFS and on S3 in same context, which also break the
> table
> > level abstraction.
> >
> > Regards,
> > Jacky
> >
> > > 在 2017年10月3日，下午10:36，Mohammad Shahid Khan <
> [hidden email]>
> > 写道：
> > >
> > > Hi Dev,
> > > Please find the design document for Support Database Location
> > Configuration while Creating Database.
> > >
> > > Regards,
> > > Shahid
> > > <Support Database Location.docx>
> >
> >
> >
> >
>
>

cenyuhai

Re: [DISCUSSION] Support Database Location Configuration while Creating Database

In reply to this post by mohdshahidkhan

Hi, Khan:
I have some questions for your design:
1. It looks like the the following comamd is supported by spark(hive).

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name[COMMENT 'database_comment'] [LOCATION hdfs_path];
2. For spark and hive, the default tablePath is `tablePath = databaseLocation + "/" + tableName`， I think keep the default behavior is better.
For Backward Compatibility, we also support
tablePath = carbon.storeLocation + “/” + database_Name +”/” + tableName

3. What does `Carbon.update.sync.folder` means?

------------------ Original ------------------
From: "Mohammad Shahid Khan";<[hidden email]>;
Date: Tue, Oct 3, 2017 09:06 PM
To: "dev"<[hidden email]>;

Subject: [DISCUSSION] Support Database Location Configuration while Creating Database

Hi Dev,Please find the design document for Support Database Location Configuration while Creating Database.

Regards,
Shahid

mohdshahidkhan

Re: [DISCUSSION] Support Database Location Configuration while Creating Database

Hi Sea,

1. create database with location is supported by spark(hive) only, carbon
will not have any own implementation for create database. It is mention here
just for reference regarding the location attribute.
2. Why carbon want to keep tablePath = 'databaseLocation “/” +
database_Name + "/" + tableName`

There is problem if we keep the tablePath same as hive. For
CarbonFileMetaStore, carbon creates
the schema file at <TablePath>/Metadata/schema

If carbon skips adding databaseName, then two table having same name from
two different databases pointing to the same database location will cause
problem during table creation, load and query.

Even in case hive if two tables in different databases with same are
created, then we are showing then when either of the table is queried, the
content from both the tables are shown.

3. What does `Carbon.update.sync.folder` means?
This is to configure the directory for modifiedTime.mdt.
Earlier the directory path for modifiedTime.mdt was fixed to
carbon.storeLocation, but what if user decides to remove the name
service of the carbon.storeLocation.
This is required for the federation cluster, where multiple name services
will be available. So if the nameservice to which the the directory for
modifiedTime.mdt is removed then the directory could be
changed.

Regards,
Shahid

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

cenyuhai

Re: [DISCUSSION] Support Database Location Configuration whileCreating Database

Hi, Shahid:
I think you misundertood my meaning, the databaseLocation you mentioned is like carbon.storeLocation, not databaseLocation in Hive.
Your databaseLocation is like `hive.metastore.warehouse.dir`,
The default behavior in spark(hive):
If we do not specify database location.
databaseLocation = hive.metastore.warehouse.dir/spark.sql.warehouse.dir + "/" + databaseName.db + "/" + tableName
So databaseLocation is unique.
If we do not specify table location:
databaseLocation + '/' + tableName

------------------ Original ------------------
From: "mohdshahidkhan1987";<[hidden email]>;
Date: Fri, Oct 6, 2017 08:40 PM
To: "dev"<[hidden email]>;

Subject: Re: [DISCUSSION] Support Database Location Configuration whileCreating Database

Hi Sea,

1. create database with location is supported by spark(hive) only, carbon
will not have any own implementation for create database. It is mention here
just for reference regarding the location attribute.
2. Why carbon want to keep tablePath = 'databaseLocation “/” +
database_Name + "/" + tableName`

There is problem if we keep the tablePath same as hive. For
CarbonFileMetaStore, carbon creates
the schema file at <TablePath>/Metadata/schema

If carbon skips adding databaseName, then two table having same name from
two different databases pointing to the same database location will cause
problem during table creation, load and query.

Even in case hive if two tables in different databases with same are
created, then we are showing then when either of the table is queried, the
content from both the tables are shown.

3. What does `Carbon.update.sync.folder` means?
This is to configure the directory for modifiedTime.mdt.
Earlier the directory path for modifiedTime.mdt was fixed to
carbon.storeLocation, but what if user decides to remove the name
service of the carbon.storeLocation.
This is required for the federation cluster, where multiple name services
will be available. So if the nameservice to which the the directory for
modifiedTime.mdt is removed then the directory could be
changed.

Regards,
Shahid

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

mohdshahidkhan

Re: [DISCUSSION] Support Database Location Configuration whileCreating Database

Hi Dev,

Please find updated design documents for "Support Database Location Configuration while Creating Database"

Changes:

Carbon will follow the same approach as hive is following.

The table path should be formed from database location or fixed Carbon store location and table name as given below.

There will be three possible scenarios:

I. Table path for the databases defined with location attribute.

tablePath = databaseLocation +”/” + tableName

II. Table path for the databases defined without location attribute.

tablePath = carbon.storeLocation + “/” + database_Name+”.db” +”/” + tableName

III. New table path for the default database.

tablePath = carbon.storeLocation +”/” + tableName

Regards,

Shahid

On Sat, Oct 7, 2017 at 5:26 PM, Sea <[hidden email]> wrote:

Hi, Shahid:
I think you misundertood my meaning, the databaseLocation you mentioned is like carbon.storeLocation, not databaseLocation in Hive.
Your databaseLocation is like `hive.metastore.warehouse.dir`,
The default behavior in spark(hive):
If we do not specify database location.
databaseLocation = hive.metastore.warehouse.dir/spark.sql.warehouse.dir + "/" + databaseName.db + "/" + tableName
So databaseLocation is unique.
If we do not specify table location:
databaseLocation + '/' + tableName

------------------ Original ------------------
From: "mohdshahidkhan1987";<[hidden email]>;
Date: Fri, Oct 6, 2017 08:40 PM
To: "dev"<[hidden email]>;
Subject: Re: [DISCUSSION] Support Database Location Configuration whileCreating Database

Hi Sea,

1. create database with location is supported by spark(hive) only, carbon
will not have any own implementation for create database. It is mention here
just for reference regarding the location attribute.
2. Why carbon want to keep tablePath = 'databaseLocation “/” +
database_Name + "/" + tableName`

There is problem if we keep the tablePath same as hive. For
CarbonFileMetaStore, carbon creates
the schema file at <TablePath>/Metadata/schema

If carbon skips adding databaseName, then two table having same name from
two different databases pointing to the same database location will cause
problem during table creation, load and query.

Even in case hive if two tables in different databases with same are
created, then we are showing then when either of the table is queried, the
content from both the tables are shown.

3. What does `Carbon.update.sync.folder` means?
This is to configure the directory for modifiedTime.mdt.
Earlier the directory path for modifiedTime.mdt was fixed to
carbon.storeLocation, but what if user           decides to remove the name
service of the carbon.storeLocation.
This is required for the federation cluster, where multiple name services
will be available. So if the     nameservice to which the the directory for
modifiedTime.mdt is removed then the directory could be
changed.

Regards,
Shahid

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Support Database Location_V2.docx (25K) Download Attachment