Partition features of Spark:
1. Creating table with partition CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] 2. Load data Static Partition LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt' INTO TABLE partitioned_user PARTITION (country = 'US', state = 'CA') INSERT OVERWRITE TABLE partitioned_user PARTITION (country = 'US', state = 'AL') SELECT * FROM another_user au WHERE au.country = 'US' AND au.state = 'AL'; Dynamic Partition LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt' INTO TABLE partitioned_user PARTITION (country, state) INSERT OVERWRITE TABLE partitioned_user PARTITION (country, state) SELECT * FROM another_user; 3. Drop, show partitions SHOW PARTITIONS [db_name.]table_name ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...) 4. Updating the partitions ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec Currently, carbon supports the partitions which is custom implemented by carbon. So if community users want to use the features which are available in spark and hive in carbondata then there is a compatibility problem arrives. And also carbondata does not have built in dynamic partition. To use the partition feature of spark we should comply with the interfaces available in spark while loading and reading the data. Approach 1 : Comply with pure spark datasource API and implement standard interfaces for reading and writing of data at a file level.Just like how parquet and ORC got implemented in spark carbondata also can be implemented in the same way. To support it we need to implement a FileFormat interface for reading and writing the data at filelevel, not table level. For reading, we should implement CarbonFileInputFormat(Read data at the file level) and implement CarbonOutputFormat(Writes data per partition.) Pros : 1.It is the clean interface to use on spark, all features of spark can be worked without any impact. 2.Upgrading from new versions of spark is straightforward and simple. Cons: All Carbondata features such as IUD, compaction, Alter table and data management like show segments, delete segments etc cannot work. Approach 2: Improve and expand the current in-house partition features which already exist in carbondata. Add all the missing features like dynamic partition and comply the syntax of loading data to partitions. Pros : All current features of carbondata works without much impact. Cons: Current partition implementation does not comply to spark partition so need to spend a lot of effort to implement it. Approach 3: It is the hybrid approach of 1st approach. Basically, write the data using FileFormat and CarbonOutputFormat interfaces. So all the partition information would be added to hive automatically since we are creating the datasource table. We make sure that the current folder structure does not change while writing the data.Here we maintain the mapping file inside segment folder for mapping between the partition and carbonindex file. And while reading we first get the partition information from the hive and do the pruning and based on the pruned partitions read the partition mapping file to get the carbonindex for querying. Here we will not support the current carbondata partition feature but we support spark partition features. Pros: 1. Support the standard interface for loading data. So features like partition and bucketing automatically supported. 2. All standard SQL syntax works fine with this approach. 3. All current features of carbon also work fine. Cons: 1.Existing partition feature cannot work. 2.Minor impact on features like compaction, IUD, clean files because of maintaining the partition mapping file. -- Thanks & Regards, Ravindra |
Hi, I prefer the approach 3. If we use approach 3, hive, presto integration can also do partition pruning for carbon, right?
Regards, Jacky > 在 2017年11月21日,下午10:56,Ravindra Pesala <[hidden email]> 写道: > > Partition features of Spark: > > 1. Creating table with partition > CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name > [(col_name1 col_type1 [COMMENT col_comment1], ...)] > USING datasource > [OPTIONS (key1=val1, key2=val2, ...)] > [PARTITIONED BY (col_name1, col_name2, ...)] > [TBLPROPERTIES (key1=val1, key2=val2, ...)] > [AS select_statement] > > 2. Load data > Static Partition > > LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt' > INTO TABLE partitioned_user > PARTITION (country = 'US', state = 'CA') > > INSERT OVERWRITE TABLE partitioned_user > PARTITION (country = 'US', state = 'AL') > SELECT * FROM another_user au > WHERE au.country = 'US' AND au.state = 'AL'; > > Dynamic Partition > > LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt' > INTO TABLE partitioned_user > PARTITION (country, state) > > INSERT OVERWRITE TABLE partitioned_user > PARTITION (country, state) > SELECT * FROM another_user; > > 3. Drop, show partitions > SHOW PARTITIONS [db_name.]table_name > ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...) > > 4. Updating the partitions > ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec > > > Currently, carbon supports the partitions which is custom implemented by > carbon. So if community users want to use the features which are available > in spark and hive in carbondata then there is a compatibility problem > arrives. And also carbondata does not have built in dynamic partition. > To use the partition feature of spark we should comply with the interfaces > available in spark while loading and reading the data. > > Approach 1 : > Comply with pure spark datasource API and implement standard interfaces for > reading and writing of data at a file level.Just like how parquet and ORC > got implemented in spark carbondata also can be implemented in the same way. > To support it we need to implement a FileFormat interface for reading and > writing the data at filelevel, not table level. For reading, we should > implement CarbonFileInputFormat(Read data at the file level) and implement > CarbonOutputFormat(Writes data per partition.) > Pros : > 1.It is the clean interface to use on spark, all features of spark can be > worked without any impact. > 2.Upgrading from new versions of spark is straightforward and simple. > Cons: > All Carbondata features such as IUD, compaction, Alter table and data > management like show segments, delete segments etc cannot work. > > Approach 2: > Improve and expand the current in-house partition features which already > exist in carbondata. Add all the missing features like dynamic partition > and comply the syntax of loading data to partitions. > Pros : > All current features of carbondata works without much impact. > Cons: > Current partition implementation does not comply to spark partition so need > to spend a lot of effort to implement it. > > Approach 3: > It is the hybrid approach of 1st approach. Basically, write the data using > FileFormat and CarbonOutputFormat interfaces. So all the partition > information would be added to hive automatically since we are creating the > datasource table. We make sure that the current folder structure does not > change while writing the data.Here we maintain the mapping file inside > segment folder for mapping between the partition and carbonindex file. And > while reading we first get the partition information from the hive and do > the pruning and based on the pruned partitions read the partition mapping > file to get the carbonindex for querying. > Here we will not support the current carbondata partition feature but we > support spark partition features. > Pros: > 1. Support the standard interface for loading data. So features like > partition and bucketing automatically supported. > 2. All standard SQL syntax works fine with this approach. > 3. All current features of carbon also work fine. > Cons: > 1.Existing partition feature cannot work. > 2.Minor impact on features like compaction, IUD, clean files because of > maintaining the partition mapping file. > > -- > Thanks & Regards, > Ravindra |
In reply to this post by ravipesala
The datasource api still have a problem that it do not support hybird
fileformat table. Detail description about hybird fileformat table is in this issue: https://issues.apache.org/jira/browse/CARBONDATA-1377. All partitions' fileformat of datasource table must be the same. So we can't change fileformat to carbodata by command "alter table table_xxx set fileformat carbondata;" So I think implement TableReader is the right way. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Hi, Please find the design document for standard partition support in carbon. Regards, Ravindra. On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: The datasource api still have a problem that it do not support hybird Thanks & Regards,
Ravi Standard Partitioning Support in CarbonData.docx (13K) Download Attachment |
Hi, Ravindra:
I read your design documents, why not use the standard hive/spark folder structure, is there any problem if use the hive/spark folder structure? Best regards! Yuhai Cen 在2017年12月4日 14:09,Ravindra Pesala<[hidden email]> 写道: Hi, Please find the design document for standard partition support in carbon. https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT0-6pkQ7GQ/edit?usp=sharing Regards, Ravindra. On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: The datasource api still have a problem that it do not support hybird fileformat table. Detail description about hybird fileformat table is in this issue: https://issues.apache.org/jira/browse/CARBONDATA-1377. All partitions' fileformat of datasource table must be the same. So we can't change fileformat to carbodata by command "alter table table_xxx set fileformat carbondata;" So I think implement TableReader is the right way. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ -- Thanks & Regards, Ravi |
Hi Jacky,
Here we have the main problem with the underlying segment based design of carbon. For every increment load carbon creates a segment and manages the segments through the tablestatus file. The changes will be very big and impact is more if we try to change this design. And also we will have a problem with backward compatibility when the folder structure changes in new loads. Regards, Ravindra. On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote: > Hi, Ravindra: > I read your design documents, why not use the standard hive/spark > folder structure, is there any problem if use the hive/spark folder > structure? > > > > > > > > > Best regards! > Yuhai Cen > > > 在2017年12月4日 14:09,Ravindra Pesala<[hidden email]> 写道: > Hi, > > > Please find the design document for standard partition support in carbon. > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT > 0-6pkQ7GQ/edit?usp=sharing > > > > > > > > Regards, > Ravindra. > > > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: > The datasource api still have a problem that it do not support hybird > fileformat table. > Detail description about hybird fileformat table is in this issue: > https://issues.apache.org/jira/browse/CARBONDATA-1377. > > All partitions' fileformat of datasource table must be the same. > So we can't change fileformat to carbodata by command "alter table > table_xxx > set fileformat carbondata;" > > So I think implement TableReader is the right way. > > > > > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/ > > > > > > > -- > > Thanks & Regards, > Ravi > -- Thanks & Regards, Ravi |
I still insist that if we want to make carbon a general fileformt on hadoop ecosystem, we should support standard hive/spark folder structure.
we can use the folder structure like this: TABLE_PATH Customer=US |--Segement_0 |---0-12212.carbonindex |---PART-00-12212.carbondata |---0-34343.carbonindex |---PART-00-34343.carbondata or TABLE_PATH Customer=US |--Part0 |--Fact |--Segement_0 |---0-12212.carbonindex |---PART-00-12212.carbondata |---0-34343.carbonindex |---PART-00-34343.carbondata I know there will be some impact on compaction and segment management. @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact? Best regards! Yuhai Cen 在2017年12月5日 15:29,Ravindra Pesala<[hidden email]> 写道: Hi Jacky, Here we have the main problem with the underlying segment based design of carbon. For every increment load carbon creates a segment and manages the segments through the tablestatus file. The changes will be very big and impact is more if we try to change this design. And also we will have a problem with backward compatibility when the folder structure changes in new loads. Regards, Ravindra. On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote: > Hi, Ravindra: > I read your design documents, why not use the standard hive/spark > folder structure, is there any problem if use the hive/spark folder > structure? > > > > > > > > > Best regards! > Yuhai Cen > > > 在2017年12月4日 14:09,Ravindra Pesala<[hidden email]> 写道: > Hi, > > > Please find the design document for standard partition support in carbon. > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT > 0-6pkQ7GQ/edit?usp=sharing > > > > > > > > Regards, > Ravindra. > > > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: > The datasource api still have a problem that it do not support hybird > fileformat table. > Detail description about hybird fileformat table is in this issue: > https://issues.apache.org/jira/browse/CARBONDATA-1377. > > All partitions' fileformat of datasource table must be the same. > So we can't change fileformat to carbodata by command "alter table > table_xxx > set fileformat carbondata;" > > So I think implement TableReader is the right way. > > > > > > > > -- > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > n5.nabble.com/ > > > > > > > -- > > Thanks & Regards, > Ravi > -- Thanks & Regards, Ravi |
Hi Yuhai Cen,
Yes you are right, we should support standard folder structure like hive to generalize the fileformat but we have a lot of other features which are built upon this folder structure. So removing of this will have a lot of impact on those features. Right now we are implementing CarbonTableOutputFormat which manages table segments while loading and writes data in the current carbon folder structure. And one more outputformat called CarbonOutputFormat and CarbonInputFormat which just writes and reads the data to file which is totally managed by spark/hive, so these interfaces will be the generalized fileformat interfaces to integrate with systems like hive/presto. Regards, Ravindra. On 9 December 2017 at 11:20, 岑玉海 <[hidden email]> wrote: > I still insist that if we want to make carbon a general fileformt on > hadoop ecosystem, we should support standard hive/spark folder structure. > > > we can use the folder structure like this: > TABLE_PATH > > Customer=US > > |--Segement_0 > > |---0-12212.carbonindex > > |---PART-00-12212.carbondata > > |---0-34343.carbonindex > > |---PART-00-34343.carbondata > > or > TABLE_PATH > > Customer=US > > |--Part0 > > |--Fact > > |--Segement_0 > > |---0-12212.carbonindex > > |---PART-00-12212.carbondata > > |---0-34343.carbonindex > > |---PART-00-34343.carbondata > > > > > > > > > > I know there will be some impact on compaction and segment management. > > @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact? > > > > Best regards! > Yuhai Cen > > > 在2017年12月5日 15:29,Ravindra Pesala<[hidden email]> 写道: > Hi Jacky, > > Here we have the main problem with the underlying segment based design of > carbon. For every increment load carbon creates a segment and manages the > segments through the tablestatus file. The changes will be very big and > impact is more if we try to change this design. And also we will have a > problem with backward compatibility when the folder structure changes in > new loads. > > Regards, > Ravindra. > > On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote: > > > Hi, Ravindra: > > I read your design documents, why not use the standard hive/spark > > folder structure, is there any problem if use the hive/spark folder > > structure? > > > > > > > > > > > > > > > > > > Best regards! > > Yuhai Cen > > > > > > 在2017年12月4日 14:09,Ravindra Pesala<[hidden email]> 写道: > > Hi, > > > > > > Please find the design document for standard partition support in carbon. > > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT > > 0-6pkQ7GQ/edit?usp=sharing > > > > > > > > > > > > > > > > Regards, > > Ravindra. > > > > > > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: > > The datasource api still have a problem that it do not support hybird > > fileformat table. > > Detail description about hybird fileformat table is in this issue: > > https://issues.apache.org/jira/browse/CARBONDATA-1377. > > > > All partitions' fileformat of datasource table must be the same. > > So we can't change fileformat to carbodata by command "alter table > > table_xxx > > set fileformat carbondata;" > > > > So I think implement TableReader is the right way. > > > > > > > > > > > > > > > > -- > > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > > n5.nabble.com/ > > > > > > > > > > > > > > -- > > > > Thanks & Regards, > > Ravi > > > > > > -- > Thanks & Regards, > Ravi > -- Thanks & Regards, Ravi |
In reply to this post by cenyuhai11
Hi Yuhai Cen,
As told by Ravindra, I think we need to have two OutputFormat finally. 1. CarbonTableOutputFormat This is needed to maintain the segment structure of carbondata, and enable all segment related command for the partitioned table, such as Show Segments, Delete Segment, etc. 2. CarbonFileOutputFormat This will write carbondata files directly to the partition folder without the segment folder, and the segment related command may not work in this case. This OutputFormat is an incremental effort based on CarbonTableOutputFormat work. So now we are focusing on implementing CarbonTableOutputFormat, once it is done, CarbonFileOutputFormat can be added later. Regards, Jacky > 在 2017年12月9日,下午1:50,岑玉海 <[hidden email]> 写道: > > I still insist that if we want to make carbon a general fileformt on hadoop ecosystem, we should support standard hive/spark folder structure. > > we can use the folder structure like this: > TABLE_PATH > Customer=US > |--Segement_0 > |---0-12212.carbonindex > |---PART-00-12212.carbondata > |---0-34343.carbonindex > |---PART-00-34343.carbondata > or > TABLE_PATH > Customer=US > |--Part0 > |--Fact > |--Segement_0 > |---0-12212.carbonindex > |---PART-00-12212.carbondata > |---0-34343.carbonindex > |---PART-00-34343.carbondata > > > > I know there will be some impact on compaction and segment management. > @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact? > > > Best regards! > Yuhai Cen > > 在2017年12月5日 15:29,Ravindra Pesala<[hidden email]> <mailto:[hidden email]> 写道: > Hi Jacky, > > Here we have the main problem with the underlying segment based design of > carbon. For every increment load carbon creates a segment and manages the > segments through the tablestatus file. The changes will be very big and > impact is more if we try to change this design. And also we will have a > problem with backward compatibility when the folder structure changes in > new loads. > > Regards, > Ravindra. > > On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote: > > > Hi, Ravindra: > > I read your design documents, why not use the standard hive/spark > > folder structure, is there any problem if use the hive/spark folder > > structure? > > > > > > > > > > > > > > > > > > Best regards! > > Yuhai Cen > > > > > > 在2017年12月4日 14:09,Ravindra Pesala<[hidden email]> 写道: > > Hi, > > > > > > Please find the design document for standard partition support in carbon. > > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT > > 0-6pkQ7GQ/edit?usp=sharing > > > > > > > > > > > > > > > > Regards, > > Ravindra. > > > > > > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: > > The datasource api still have a problem that it do not support hybird > > fileformat table. > > Detail description about hybird fileformat table is in this issue: > > https://issues.apache.org/jira/browse/CARBONDATA-1377. > > > > All partitions' fileformat of datasource table must be the same. > > So we can't change fileformat to carbodata by command "alter table > > table_xxx > > set fileformat carbondata;" > > > > So I think implement TableReader is the right way. > > > > > > > > > > > > > > > > -- > > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > > n5.nabble.com/ > > > > > > > > > > > > > > -- > > > > Thanks & Regards, > > Ravi > > > > > > -- > Thanks & Regards, > Ravi |
ok
Best regards! Yuhai Cen 在2017年12月10日 11:17,Jacky Li<[hidden email]> 写道: Hi Yuhai Cen, As told by Ravindra, I think we need to have two OutputFormat finally. 1. CarbonTableOutputFormat This is needed to maintain the segment structure of carbondata, and enable all segment related command for the partitioned table, such as Show Segments, Delete Segment, etc. 2. CarbonFileOutputFormat This will write carbondata files directly to the partition folder without the segment folder, and the segment related command may not work in this case. This OutputFormat is an incremental effort based on CarbonTableOutputFormat work. So now we are focusing on implementing CarbonTableOutputFormat, once it is done, CarbonFileOutputFormat can be added later. Regards, Jacky > 在 2017年12月9日,下午1:50,岑玉海 <[hidden email]> 写道: > > I still insist that if we want to make carbon a general fileformt on hadoop ecosystem, we should support standard hive/spark folder structure. > > we can use the folder structure like this: > TABLE_PATH > Customer=US > |--Segement_0 > |---0-12212.carbonindex > |---PART-00-12212.carbondata > |---0-34343.carbonindex > |---PART-00-34343.carbondata > or > TABLE_PATH > Customer=US > |--Part0 > |--Fact > |--Segement_0 > |---0-12212.carbonindex > |---PART-00-12212.carbondata > |---0-34343.carbonindex > |---PART-00-34343.carbondata > > > > I know there will be some impact on compaction and segment management. > @Jacky @Ravindra @chenliang @David CaiQiang can you estimate the impact? > > > Best regards! > Yuhai Cen > > 在2017年12月5日 15:29,Ravindra Pesala<[hidden email]> <mailto:[hidden email]> 写道: > Hi Jacky, > > Here we have the main problem with the underlying segment based design of > carbon. For every increment load carbon creates a segment and manages the > segments through the tablestatus file. The changes will be very big and > impact is more if we try to change this design. And also we will have a > problem with backward compatibility when the folder structure changes in > new loads. > > Regards, > Ravindra. > > On 5 December 2017 at 10:12, 岑玉海 <[hidden email]> wrote: > > > Hi, Ravindra: > > I read your design documents, why not use the standard hive/spark > > folder structure, is there any problem if use the hive/spark folder > > structure? > > > > > > > > > > > > > > > > > > Best regards! > > Yuhai Cen > > > > > > 在2017年12月4日 14:09,Ravindra Pesala<[hidden email]> 写道: > > Hi, > > > > > > Please find the design document for standard partition support in carbon. > > https://docs.google.com/document/d/1NJo_Qq4eovl7YRuT9O7yWTL0P378HnC8WT > > 0-6pkQ7GQ/edit?usp=sharing > > > > > > > > > > > > > > > > Regards, > > Ravindra. > > > > > > On 27 November 2017 at 17:36, cenyuhai11 <[hidden email]> wrote: > > The datasource api still have a problem that it do not support hybird > > fileformat table. > > Detail description about hybird fileformat table is in this issue: > > https://issues.apache.org/jira/browse/CARBONDATA-1377. > > > > All partitions' fileformat of datasource table must be the same. > > So we can't change fileformat to carbodata by command "alter table > > table_xxx > > set fileformat carbondata;" > > > > So I think implement TableReader is the right way. > > > > > > > > > > > > > > > > -- > > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556. > > n5.nabble.com/ > > > > > > > > > > > > > > -- > > > > Thanks & Regards, > > Ravi > > > > > > -- > Thanks & Regards, > Ravi |
Free forum by Nabble | Edit this page |