Apache CarbonData Dev Mailing List archive

[Discussion] Support Spark/Hive based partition in carbon

Classic

List

Threaded

10 messages Options

ravipesala

[Discussion] Support Spark/Hive based partition in carbon

Partition features of Spark:

1. Creating table with partition
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1 col_type1 [COMMENT col_comment1], ...)]
USING datasource
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]

2. Load data
Static Partition

LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
INTO TABLE partitioned_user
PARTITION (country = 'US', state = 'CA')

INSERT OVERWRITE TABLE partitioned_user
PARTITION (country = 'US', state = 'AL')
SELECT * FROM another_user au
WHERE au.country = 'US' AND au.state = 'AL';

Dynamic Partition

LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
INTO TABLE partitioned_user
PARTITION (country, state)

INSERT OVERWRITE TABLE partitioned_user
PARTITION (country, state)
SELECT * FROM another_user;

3. Drop, show partitions
SHOW PARTITIONS [db_name.]table_name
ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)

4. Updating the partitions
ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec

Currently, carbon supports the partitions which is custom implemented by
carbon. So if community users want to use the features which are available
in spark and hive in carbondata then there is a compatibility problem
arrives. And also carbondata does not have built in dynamic partition.
To use the partition feature of spark we should comply with the interfaces
available in spark while loading and reading the data.

Approach 1 :
Comply with pure spark datasource API and implement standard interfaces for
reading and writing of data at a file level.Just like how parquet and ORC
got implemented in spark carbondata also can be implemented in the same way.
To support it we need to implement a FileFormat interface for reading and
writing the data at filelevel, not table level. For reading, we should
implement CarbonFileInputFormat(Read data at the file level) and implement
CarbonOutputFormat(Writes data per partition.)
Pros :
1.It is the clean interface to use on spark, all features of spark can be
worked without any impact.
2.Upgrading from new versions of spark is straightforward and simple.
Cons:
All Carbondata features such as IUD, compaction, Alter table and data
management like show segments, delete segments etc cannot work.

Approach 2:
Improve and expand the current in-house partition features which already
exist in carbondata. Add all the missing features like dynamic partition
and comply the syntax of loading data to partitions.
Pros :
All current features of carbondata works without much impact.
Cons:
Current partition implementation does not comply to spark partition so need
to spend a lot of effort to implement it.

Approach 3:
It is the hybrid approach of 1st approach. Basically, write the data using
FileFormat and CarbonOutputFormat interfaces. So all the partition
information would be added to hive automatically since we are creating the
datasource table. We make sure that the current folder structure does not
change while writing the data.Here we maintain the mapping file inside
segment folder for mapping between the partition and carbonindex file. And
while reading we first get the partition information from the hive and do
the pruning and based on the pruned partitions read the partition mapping
file to get the carbonindex for querying.
Here we will not support the current carbondata partition feature but we
support spark partition features.
Pros:
1. Support the standard interface for loading data. So features like
partition and bucketing automatically supported.
2. All standard SQL syntax works fine with this approach.
3. All current features of carbon also work fine.
Cons:
1.Existing partition feature cannot work.
2.Minor impact on features like compaction, IUD, clean files because of
maintaining the partition mapping file.

--
Thanks & Regards,
Ravindra

Jacky Li

Re: [Discussion] Support Spark/Hive based partition in carbon

Hi, I prefer the approach 3. If we use approach 3, hive, presto integration can also do partition pruning for carbon, right?

Regards,
Jacky

> 在 2017年11月21日，下午10:56，Ravindra Pesala <[hidden email]> 写道：
>
> Partition features of Spark:
>
> 1. Creating table with partition
> CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
> [(col_name1 col_type1 [COMMENT col_comment1], ...)]
> USING datasource
> [OPTIONS (key1=val1, key2=val2, ...)]
> [PARTITIONED BY (col_name1, col_name2, ...)]
> [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> [AS select_statement]
>
> 2. Load data
> Static Partition
>
> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
> INTO TABLE partitioned_user
> PARTITION (country = 'US', state = 'CA')
>
> INSERT OVERWRITE TABLE partitioned_user
> PARTITION (country = 'US', state = 'AL')
> SELECT * FROM another_user au
> WHERE au.country = 'US' AND au.state = 'AL';
>
> Dynamic Partition
>
> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
> INTO TABLE partitioned_user
> PARTITION (country, state)
>
> INSERT OVERWRITE TABLE partitioned_user
> PARTITION (country, state)
> SELECT * FROM another_user;
>
> 3. Drop, show partitions
> SHOW PARTITIONS [db_name.]table_name
> ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)
>
> 4. Updating the partitions
> ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec
>
>
> Currently, carbon supports the partitions which is custom implemented by
> carbon. So if community users want to use the features which are available
> in spark and hive in carbondata then there is a compatibility problem
> arrives. And also carbondata does not have built in dynamic partition.
> To use the partition feature of spark we should comply with the interfaces
> available in spark while loading and reading the data.
>
> Approach 1 :
> Comply with pure spark datasource API and implement standard interfaces for
> reading and writing of data at a file level.Just like how parquet and ORC
> got implemented in spark carbondata also can be implemented in the same way.
> To support it we need to implement a FileFormat interface for reading and
> writing the data at filelevel, not table level. For reading, we should
> implement CarbonFileInputFormat(Read data at the file level) and implement
> CarbonOutputFormat(Writes data per partition.)
> Pros :
> 1.It is the clean interface to use on spark, all features of spark can be
> worked without any impact.
> 2.Upgrading from new versions of spark is straightforward and simple.
> Cons:
> All Carbondata features such as IUD, compaction, Alter table and data
> management like show segments, delete segments etc cannot work.
>
> Approach 2:
> Improve and expand the current in-house partition features which already
> exist in carbondata. Add all the missing features like dynamic partition
> and comply the syntax of loading data to partitions.
> Pros :
> All current features of carbondata works without much impact.
> Cons:
> Current partition implementation does not comply to spark partition so need
> to spend a lot of effort to implement it.
>
> Approach 3:
> It is the hybrid approach of 1st approach. Basically, write the data using
> FileFormat and CarbonOutputFormat interfaces. So all the partition
> information would be added to hive automatically since we are creating the
> datasource table. We make sure that the current folder structure does not
> change while writing the data.Here we maintain the mapping file inside
> segment folder for mapping between the partition and carbonindex file. And
> while reading we first get the partition information from the hive and do
> the pruning and based on the pruned partitions read the partition mapping
> file to get the carbonindex for querying.
> Here we will not support the current carbondata partition feature but we
> support spark partition features.
> Pros:
> 1. Support the standard interface for loading data. So features like
> partition and bucketing automatically supported.
> 2. All standard SQL syntax works fine with this approach.
> 3. All current features of carbon also work fine.
> Cons:
> 1.Existing partition feature cannot work.
> 2.Minor impact on features like compaction, IUD, clean files because of
> maintaining the partition mapping file.
>
> --
> Thanks & Regards,
> Ravindra

cenyuhai11

Re: [Discussion] Support Spark/Hive based partition in carbon

In reply to this post by ravipesala

The datasource api still have a problem that it do not support hybird
fileformat table.
Detail description about hybird fileformat table is in this issue:
https://issues.apache.org/jira/browse/CARBONDATA-1377.

All partitions' fileformat of datasource table must be the same.
So we can't change fileformat to carbodata by command "alter table table_xxx
set fileformat carbondata;"

So I think implement TableReader is the right way.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

ravipesala

Re: [Discussion] Support Spark/Hive based partition in carbon

Hi,

Please find the design document for standard partition support in carbon.

https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT0-6pkQ7GQ/edit?usp=sharing

Regards,

Ravindra.

On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote:

The datasource api still have a problem that it do not support hybird
fileformat table.
Detail description about hybird fileformat table is in this issue:
https://issues.apache.org/jira/browse/CARBONDATA-1377.

All partitions' fileformat of datasource table must be the same.
So we can't change fileformat to carbodata by command "alter table table_xxx
set fileformat carbondata;"

So I think implement TableReader is the right way.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Thanks & Regards,
Ravi

Standard Partitioning Support in CarbonData.docx (13K) Download Attachment

cenyuhai11

回复： [Discussion] Support Spark/Hive based partition in carbon

Hi, Ravindra:
I read your design documents, why not use the standard hive/spark folder structure, is there any problem if use the hive/spark folder structure？

Best regards!
Yuhai Cen

在2017年12月4日 14:09，Ravindra Pesala<[hidden email]> 写道：
Hi,

Please find the design document for standard partition support in carbon.
https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT0-6pkQ7GQ/edit?usp=sharing

Regards,
Ravindra.

On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote:
The datasource api still have a problem that it do not support hybird
fileformat table.
Detail description about hybird fileformat table is in this issue:
https://issues.apache.org/jira/browse/CARBONDATA-1377.

All partitions' fileformat of datasource table must be the same.
So we can't change fileformat to carbodata by command "alter table table_xxx
set fileformat carbondata;"

So I think implement TableReader is the right way.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

--

Thanks & Regards,
Ravi

ravipesala

Re: [Discussion] Support Spark/Hive based partition in carbon

Hi Jacky,

Here we have the main problem with the underlying segment based design of
carbon. For every increment load carbon creates a segment and manages the
segments through the tablestatus file. The changes will be very big and
impact is more if we try to change this design. And also we will have a
problem with backward compatibility when the folder structure changes in
new loads.

Regards,
Ravindra.

On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote:

> Hi, Ravindra:
> I read your design documents, why not use the standard hive/spark
> folder structure, is there any problem if use the hive/spark folder
> structure？
>
>
>
>
>
>
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年12月4日 14:09，Ravindra Pesala<[hidden email]> 写道：
> Hi,
>
>
> Please find the design document for standard partition support in carbon.
> https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT
> 0-6pkQ7GQ/edit?usp=sharing
>
>
>
>
>
>
>
> Regards,
> Ravindra.
>
>
> On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote:
> The datasource api still have a problem that it do not support hybird
> fileformat table.
> Detail description about hybird fileformat table is in this issue:
> https://issues.apache.org/jira/browse/CARBONDATA-1377.
>
> All partitions' fileformat of datasource table must be the same.
> So we can't change fileformat to carbodata by command "alter table
> table_xxx
> set fileformat carbondata;"
>
> So I think implement TableReader is the right way.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
>
>
>
>
>
>
> --
>
> Thanks & Regards,
> Ravi
>

--
Thanks & Regards,
Ravi

cenyuhai11

回复： [Discussion] Support Spark/Hive based partition in carbon

I still insist that if we want to make carbon a general fileformt on hadoop ecosystem, we should support standard hive/spark folder structure.

we can use the folder structure like this:
TABLE_PATH

Customer=US

|--Segement_0

|---0-12212.carbonindex

|---PART-00-12212.carbondata

|---0-34343.carbonindex

|---PART-00-34343.carbondata

or
TABLE_PATH

Customer=US

|--Part0

|--Fact

|--Segement_0

|---0-12212.carbonindex

|---PART-00-12212.carbondata

|---0-34343.carbonindex

|---PART-00-34343.carbondata

I know there will be some impact on compaction and segment management.

@Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact?

Best regards!
Yuhai Cen

在2017年12月5日 15:29，Ravindra Pesala<[hidden email]> 写道：
Hi Jacky,

Here we have the main problem with the underlying segment based design of
carbon. For every increment load carbon creates a segment and manages the
segments through the tablestatus file. The changes will be very big and
impact is more if we try to change this design. And also we will have a
problem with backward compatibility when the folder structure changes in
new loads.

Regards,
Ravindra.

On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote:

--
Thanks & Regards,
Ravi

ravipesala

Re: [Discussion] Support Spark/Hive based partition in carbon

Hi Yuhai Cen,

Yes you are right, we should support standard folder structure like hive to
generalize the fileformat but we have a lot of other features which are
built upon this folder structure. So removing of this will have a lot of
impact on those features. Right now we are implementing
CarbonTableOutputFormat which manages table segments while loading and
writes data in the current carbon folder structure. And one more
outputformat called CarbonOutputFormat and CarbonInputFormat which just
writes and reads the data to file which is totally managed by spark/hive,
so these interfaces will be the generalized fileformat interfaces to
integrate with systems like hive/presto.

Regards,
Ravindra.

On 9 December 2017 at 11:20, 岑玉海 <[hidden email]> wrote:

> I still insist that if we want to make carbon a general fileformt on
> hadoop ecosystem, we should support standard hive/spark folder structure.
>
>
> we can use the folder structure like this:
> TABLE_PATH
>
> Customer=US
>
> |--Segement_0
>
> |---0-12212.carbonindex
>
> |---PART-00-12212.carbondata
>
> |---0-34343.carbonindex
>
> |---PART-00-34343.carbondata
>
> or
> TABLE_PATH
>
> Customer=US
>
> |--Part0
>
> |--Fact
>
> |--Segement_0
>
> |---0-12212.carbonindex
>
> |---PART-00-12212.carbondata
>
> |---0-34343.carbonindex
>
> |---PART-00-34343.carbondata
>
>
>
>
>
>
>
>
>
> I know there will be some impact on compaction and segment management.
>
> @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact?
>
>
>
> Best regards!
> Yuhai Cen
>
>
> 在2017年12月5日 15:29，Ravindra Pesala<[hidden email]> 写道：
> Hi Jacky,
>
> Here we have the main problem with the underlying segment based design of
> carbon. For every increment load carbon creates a segment and manages the
> segments through the tablestatus file. The changes will be very big and
> impact is more if we try to change this design. And also we will have a
> problem with backward compatibility when the folder structure changes in
> new loads.
>
> Regards,
> Ravindra.
>
> On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote:
>
> > Hi, Ravindra:
> > I read your design documents, why not use the standard hive/spark
> > folder structure, is there any problem if use the hive/spark folder
> > structure？
> >
> >
> >
> >
> >
> >
> >
> >
> > Best regards!
> > Yuhai Cen
> >
> >
> > 在2017年12月4日 14:09，Ravindra Pesala<[hidden email]> 写道：
> > Hi,
> >
> >
> > Please find the design document for standard partition support in carbon.
> > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT
> > 0-6pkQ7GQ/edit?usp=sharing
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> > Ravindra.
> >
> >
> > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote:
> > The datasource api still have a problem that it do not support hybird
> > fileformat table.
> > Detail description about hybird fileformat table is in this issue:
> > https://issues.apache.org/jira/browse/CARBONDATA-1377.
> >
> > All partitions' fileformat of datasource table must be the same.
> > So we can't change fileformat to carbodata by command "alter table
> > table_xxx
> > set fileformat carbondata;"
> >
> > So I think implement TableReader is the right way.
> >
> >
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> > n5.nabble.com/
> >
> >
> >
> >
> >
> >
> > --
> >
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>

--
Thanks & Regards,
Ravi

Jacky Li

Re: [Discussion] Support Spark/Hive based partition in carbon

In reply to this post by cenyuhai11

Hi Yuhai Cen,

As told by Ravindra, I think we need to have two OutputFormat finally.

1. CarbonTableOutputFormat
This is needed to maintain the segment structure of carbondata, and enable all segment related command for the partitioned table, such as Show Segments, Delete Segment, etc.

2. CarbonFileOutputFormat
This will write carbondata files directly to the partition folder without the segment folder, and the segment related command may not work in this case. This OutputFormat is an incremental effort based on CarbonTableOutputFormat work.

So now we are focusing on implementing CarbonTableOutputFormat, once it is done, CarbonFileOutputFormat can be added later.

Regards,
Jacky

> 在 2017年12月9日，下午1:50，岑玉海 <[hidden email]> 写道：
>
> I still insist that if we want to make carbon a general fileformt on hadoop ecosystem, we should support standard hive/spark folder structure.
>
> we can use the folder structure like this:
> TABLE_PATH
> Customer=US
> |--Segement_0
> |---0-12212.carbonindex
> |---PART-00-12212.carbondata
> |---0-34343.carbonindex
> |---PART-00-34343.carbondata
> or
> TABLE_PATH
> Customer=US
> |--Part0
> |--Fact
> |--Segement_0
> |---0-12212.carbonindex
> |---PART-00-12212.carbondata
> |---0-34343.carbonindex
> |---PART-00-34343.carbondata
>
>
>
> I know there will be some impact on compaction and segment management.
> @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact?
>
>
> Best regards!
> Yuhai Cen
>
> 在2017年12月5日 15:29，Ravindra Pesala<[hidden email]> <mailto:[hidden email]> 写道：
> Hi Jacky,
>
> Here we have the main problem with the underlying segment based design of
> carbon. For every increment load carbon creates a segment and manages the
> segments through the tablestatus file. The changes will be very big and
> impact is more if we try to change this design. And also we will have a
> problem with backward compatibility when the folder structure changes in
> new loads.
>
> Regards,
> Ravindra.
>
> On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote:
>
> > Hi, Ravindra:
> > I read your design documents, why not use the standard hive/spark
> > folder structure, is there any problem if use the hive/spark folder
> > structure？
> >
> >
> >
> >
> >
> >
> >
> >
> > Best regards!
> > Yuhai Cen
> >
> >
> > 在2017年12月4日 14:09，Ravindra Pesala<[hidden email]> 写道：
> > Hi,
> >
> >
> > Please find the design document for standard partition support in carbon.
> > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT
> > 0-6pkQ7GQ/edit?usp=sharing
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> > Ravindra.
> >
> >
> > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote:
> > The datasource api still have a problem that it do not support hybird
> > fileformat table.
> > Detail description about hybird fileformat table is in this issue:
> > https://issues.apache.org/jira/browse/CARBONDATA-1377.
> >
> > All partitions' fileformat of datasource table must be the same.
> > So we can't change fileformat to carbodata by command "alter table
> > table_xxx
> > set fileformat carbondata;"
> >
> > So I think implement TableReader is the right way.
> >
> >
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> > n5.nabble.com/
> >
> >
> >
> >
> >
> >
> > --
> >
> > Thanks & Regards,
> > Ravi
> >
>
>
>
> --
> Thanks & Regards,
> Ravi

cenyuhai11

回复： [Discussion] Support Spark/Hive based partition in carbon

ok

Best regards!
Yuhai Cen

在2017年12月10日 11:17，Jacky Li<[hidden email]> 写道：
Hi Yuhai Cen,

As told by Ravindra, I think we need to have two OutputFormat finally.

1. CarbonTableOutputFormat
This is needed to maintain the segment structure of carbondata, and enable all segment related command for the partitioned table, such as Show Segments, Delete Segment, etc.

2. CarbonFileOutputFormat
This will write carbondata files directly to the partition folder without the segment folder, and the segment related command may not work in this case. This OutputFormat is an incremental effort based on CarbonTableOutputFormat work.

So now we are focusing on implementing CarbonTableOutputFormat, once it is done, CarbonFileOutputFormat can be added later.

Regards,
Jacky