Apache CarbonData Dev Mailing List archive - Re: [Discussion] Support Spark/Hive based partition in carbon

Apache CarbonData Dev Mailing List archive

Re: [Discussion] Support Spark/Hive based partition in carbon

Posted by Jacky Li on Nov 27, 2017; 11:05am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Support-Spark-Hive-based-partition-in-carbon-tp27594p28352.html

Hi, I prefer the approach 3. If we use approach 3, hive, presto integration can also do partition pruning for carbon, right?

Regards,
Jacky

> 在 2017年11月21日，下午10:56，Ravindra Pesala <[hidden email]> 写道：
>
> Partition features of Spark:
>
> 1. Creating table with partition
> CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
> [(col_name1 col_type1 [COMMENT col_comment1], ...)]
> USING datasource
> [OPTIONS (key1=val1, key2=val2, ...)]
> [PARTITIONED BY (col_name1, col_name2, ...)]
> [TBLPROPERTIES (key1=val1, key2=val2, ...)]
> [AS select_statement]
>
> 2. Load data
> Static Partition
>
> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
> INTO TABLE partitioned_user
> PARTITION (country = 'US', state = 'CA')
>
> INSERT OVERWRITE TABLE partitioned_user
> PARTITION (country = 'US', state = 'AL')
> SELECT * FROM another_user au
> WHERE au.country = 'US' AND au.state = 'AL';
>
> Dynamic Partition
>
> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
> INTO TABLE partitioned_user
> PARTITION (country, state)
>
> INSERT OVERWRITE TABLE partitioned_user
> PARTITION (country, state)
> SELECT * FROM another_user;
>
> 3. Drop, show partitions
> SHOW PARTITIONS [db_name.]table_name
> ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)
>
> 4. Updating the partitions
> ALTER TABLE table_name PARTITION part_spec RENAME TO PARTITION part_spec
>
>
> Currently, carbon supports the partitions which is custom implemented by
> carbon. So if community users want to use the features which are available
> in spark and hive in carbondata then there is a compatibility problem
> arrives. And also carbondata does not have built in dynamic partition.
> To use the partition feature of spark we should comply with the interfaces
> available in spark while loading and reading the data.
>
> Approach 1 :
> Comply with pure spark datasource API and implement standard interfaces for
> reading and writing of data at a file level.Just like how parquet and ORC
> got implemented in spark carbondata also can be implemented in the same way.
> To support it we need to implement a FileFormat interface for reading and
> writing the data at filelevel, not table level. For reading, we should
> implement CarbonFileInputFormat(Read data at the file level) and implement
> CarbonOutputFormat(Writes data per partition.)
> Pros :
> 1.It is the clean interface to use on spark, all features of spark can be
> worked without any impact.
> 2.Upgrading from new versions of spark is straightforward and simple.
> Cons:
> All Carbondata features such as IUD, compaction, Alter table and data
> management like show segments, delete segments etc cannot work.
>
> Approach 2:
> Improve and expand the current in-house partition features which already
> exist in carbondata. Add all the missing features like dynamic partition
> and comply the syntax of loading data to partitions.
> Pros :
> All current features of carbondata works without much impact.
> Cons:
> Current partition implementation does not comply to spark partition so need
> to spend a lot of effort to implement it.
>
> Approach 3:
> It is the hybrid approach of 1st approach. Basically, write the data using
> FileFormat and CarbonOutputFormat interfaces. So all the partition
> information would be added to hive automatically since we are creating the
> datasource table. We make sure that the current folder structure does not
> change while writing the data.Here we maintain the mapping file inside
> segment folder for mapping between the partition and carbonindex file. And
> while reading we first get the partition information from the hive and do
> the pruning and based on the pruned partitions read the partition mapping
> file to get the carbonindex for querying.
> Here we will not support the current carbondata partition feature but we
> support spark partition features.
> Pros:
> 1. Support the standard interface for loading data. So features like
> partition and bucketing automatically supported.
> 2. All standard SQL syntax works fine with this approach.
> 3. All current features of carbon also work fine.
> Cons:
> 1.Existing partition feature cannot work.
> 2.Minor impact on features like compaction, IUD, clean files because of
> maintaining the partition mapping file.
>
> --
> Thanks & Regards,
> Ravindra