[Discussion] Implement Partition Table Feature

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] Implement Partition Table Feature

lionel061201
Hi Dev,
I've drafted a doc about implement the partition table feature, please help
review and give your advices.

https://github.com/lionelcao/CarbonData_Docs/blob/master/partition.md

Thanks,
Cao Lu
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Implement Partition Table Feature

Jacky Li
Hi Cao Lu,

The overall design likes good to me, I just have following points need to confirm:
1. Is there detele partition DDL?
2. For the data loading part, it needs to do global shuffle before actual data loading? And the partition key should not be included in SORT_COLUMNS option, right? If yes, I think it is better to put this constrain in the document also.
3. For the query part, I suggest to add more description for index, like how B tree will be loaded into driver and many B tree will be there?
4. As a further optimization, is it possible that we map the partition to DataNode such that we do not need to communicate with NameNode for every query? Can this mapping be considered like a cache?

Regards,
Jacky
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Implement Partition Table Feature

Jacky Li

> 在 2017年4月15日,下午12:00,Jacky Li <[hidden email]> 写道:
>
> Hi Cao Lu,
>
> The overall design likes good to me, I just have following points need to
> confirm:
> 1. Is there detele partition DDL?
> 2. For the data loading part, it needs to do global shuffle before actual
> data loading? And the partition key should not be included in SORT_COLUMNS
> option, right? If yes, I think it is better to put this constrain in the
> document also.

After second thought, I think it is up to the user whether to put partition key in the SORT_COLUMNS. There should be no constrain.

> 3. For the query part, I suggest to add more description for index, like how
> B tree will be loaded into driver and many B tree will be there?
> 4. As a further optimization, is it possible that we map the partition to
> DataNode such that we do not need to communicate with NameNode for every
> query? Can this mapping be considered like a cache?
>
> Regards,
> Jacky
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11063.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Implement Partition Table Feature

David CaiQiang
In reply to this post by lionel061201
Hi Cao Lu,
  I suggest to mention the following information.

1. table creation
modify schema.thrift, add optional partitioner information to TableSchema

2. alter table add/drop partition

3. data loading of partition table
use  partitioner information of TableSchema to generate the table partitioner, then use this partitioner to repartition input RDD, finally reuse loadDataFrame flow.

use partition id to replace task no in carbondata/inde file name, so no need to store partition information in footer and index file,

4. detail query on partition table with partition column filter.
use partition column filter to get partition id list, use partition id list to filter BTree.

5. partition tables join on partition column
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Implement Partition Table Feature

David CaiQiang
sub-task list of Partition Table Feature:

1. Define PartitionInfo model
modify schema.thrift to define PartitionInfo, add PartitionInfo to TableSchema

2. Create Table with Partition
CarbonSparkSqlParser parse partition part to generate PartitionInfo, add PartitionInfo to TableModel.

CreateTable add PartitionInfo to TableInfo,  store PartitionInfo in TableSchema

3. Data loading of partition table
use PartitionInfo to generate Partitioner (hash, list, range)
use Partitioner to repartition input data file, reuse loadDataFrame flow
use partition id to replace task no in carbondata/index file name

4. Detail filter query on partition column
support equal filter to get partition id, use this partition id to filter BTree.
In the future, will support other filter(range, in...)

5. Partition tables join on partition column

6. Alter table add/drop partition

Any suggestion?

Best Regards,
David QiangCai
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Implement Partition Table Feature

lionel061201
1. carbon use different sql parser in spark1.6 and 2.1, need to change
CarbonSQLParser for 1.6
2. for interval range partition, no fixed partition name is defined in DDL,
but need to keep partition name in schema and update when new partition is
added.
3. one btree for one partition and one segment in driver side

On Mon, Apr 17, 2017 at 3:29 PM, QiangCai <[hidden email]> wrote:

> sub-task list of Partition Table Feature:
>
> 1. Define PartitionInfo model
> modify schema.thrift to define PartitionInfo, add PartitionInfo to
> TableSchema
>
> 2. Create Table with Partition
> CarbonSparkSqlParser parse partition part to generate PartitionInfo, add
> PartitionInfo to TableModel.
>
> CreateTable add PartitionInfo to TableInfo,  store PartitionInfo in
> TableSchema
>
> 3. Data loading of partition table
> use PartitionInfo to generate Partitioner (hash, list, range)
> use Partitioner to repartition input data file, reuse loadDataFrame flow
> use partition id to replace task no in carbondata/index file name
>
> 4. Detail filter query on partition column
> support equal filter to get partition id, use this partition id to filter
> BTree.
> In the future, will support other filter(range, in...)
>
> 5. Partition tables join on partition column
>
> 6. Alter table add/drop partition
>
> Any suggestion?
>
> Best Regards,
> David QiangCai
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> Implement-Partition-Table-Feature-tp10938p11151.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Implement Partition Table Feature

Liang Chen
Administrator
In reply to this post by Jacky Li
Hi

Agree, we don't need to add special constrain for one column be together existing in partition key and SORT_COLUMNS.

But from actual case, don't suggest giving partition key is same as the first column of SORT_COLUMNS, maybe we need to add the tips to partition feature's document.

Regards
Liang

Jacky Li wrote
> 在 2017年4月15日,下午12:00,Jacky Li <[hidden email]> 写道:
>
> Hi Cao Lu,
>
> The overall design likes good to me, I just have following points need to
> confirm:
> 1. Is there detele partition DDL?
> 2. For the data loading part, it needs to do global shuffle before actual
> data loading? And the partition key should not be included in SORT_COLUMNS
> option, right? If yes, I think it is better to put this constrain in the
> document also.

After second thought, I think it is up to the user whether to put partition key in the SORT_COLUMNS. There should be no constrain.

> 3. For the query part, I suggest to add more description for index, like how
> B tree will be loaded into driver and many B tree will be there?
> 4. As a further optimization, is it possible that we map the partition to
> DataNode such that we do not need to communicate with NameNode for every
> query? Can this mapping be considered like a cache?
>
> Regards,
> Jacky
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11063.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.