Hi Dev,
I've drafted a doc about implement the partition table feature, please help review and give your advices. https://github.com/lionelcao/CarbonData_Docs/blob/master/partition.md Thanks, Cao Lu |
Hi Cao Lu,
The overall design likes good to me, I just have following points need to confirm: 1. Is there detele partition DDL? 2. For the data loading part, it needs to do global shuffle before actual data loading? And the partition key should not be included in SORT_COLUMNS option, right? If yes, I think it is better to put this constrain in the document also. 3. For the query part, I suggest to add more description for index, like how B tree will be loaded into driver and many B tree will be there? 4. As a further optimization, is it possible that we map the partition to DataNode such that we do not need to communicate with NameNode for every query? Can this mapping be considered like a cache? Regards, Jacky |
> 在 2017年4月15日,下午12:00,Jacky Li <[hidden email]> 写道: > > Hi Cao Lu, > > The overall design likes good to me, I just have following points need to > confirm: > 1. Is there detele partition DDL? > 2. For the data loading part, it needs to do global shuffle before actual > data loading? And the partition key should not be included in SORT_COLUMNS > option, right? If yes, I think it is better to put this constrain in the > document also. After second thought, I think it is up to the user whether to put partition key in the SORT_COLUMNS. There should be no constrain. > 3. For the query part, I suggest to add more description for index, like how > B tree will be loaded into driver and many B tree will be there? > 4. As a further optimization, is it possible that we map the partition to > DataNode such that we do not need to communicate with NameNode for every > query? Can this mapping be considered like a cache? > > Regards, > Jacky > > > -- > View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-tp10938p11063.html > Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com. |
In reply to this post by lionel061201
Hi Cao Lu,
I suggest to mention the following information. 1. table creation modify schema.thrift, add optional partitioner information to TableSchema 2. alter table add/drop partition 3. data loading of partition table use partitioner information of TableSchema to generate the table partitioner, then use this partitioner to repartition input RDD, finally reuse loadDataFrame flow. use partition id to replace task no in carbondata/inde file name, so no need to store partition information in footer and index file, 4. detail query on partition table with partition column filter. use partition column filter to get partition id list, use partition id list to filter BTree. 5. partition tables join on partition column
Best Regards
David Cai |
sub-task list of Partition Table Feature:
1. Define PartitionInfo model modify schema.thrift to define PartitionInfo, add PartitionInfo to TableSchema 2. Create Table with Partition CarbonSparkSqlParser parse partition part to generate PartitionInfo, add PartitionInfo to TableModel. CreateTable add PartitionInfo to TableInfo, store PartitionInfo in TableSchema 3. Data loading of partition table use PartitionInfo to generate Partitioner (hash, list, range) use Partitioner to repartition input data file, reuse loadDataFrame flow use partition id to replace task no in carbondata/index file name 4. Detail filter query on partition column support equal filter to get partition id, use this partition id to filter BTree. In the future, will support other filter(range, in...) 5. Partition tables join on partition column 6. Alter table add/drop partition Any suggestion? Best Regards, David QiangCai
Best Regards
David Cai |
1. carbon use different sql parser in spark1.6 and 2.1, need to change
CarbonSQLParser for 1.6 2. for interval range partition, no fixed partition name is defined in DDL, but need to keep partition name in schema and update when new partition is added. 3. one btree for one partition and one segment in driver side On Mon, Apr 17, 2017 at 3:29 PM, QiangCai <[hidden email]> wrote: > sub-task list of Partition Table Feature: > > 1. Define PartitionInfo model > modify schema.thrift to define PartitionInfo, add PartitionInfo to > TableSchema > > 2. Create Table with Partition > CarbonSparkSqlParser parse partition part to generate PartitionInfo, add > PartitionInfo to TableModel. > > CreateTable add PartitionInfo to TableInfo, store PartitionInfo in > TableSchema > > 3. Data loading of partition table > use PartitionInfo to generate Partitioner (hash, list, range) > use Partitioner to repartition input data file, reuse loadDataFrame flow > use partition id to replace task no in carbondata/index file name > > 4. Detail filter query on partition column > support equal filter to get partition id, use this partition id to filter > BTree. > In the future, will support other filter(range, in...) > > 5. Partition tables join on partition column > > 6. Alter table add/drop partition > > Any suggestion? > > Best Regards, > David QiangCai > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Discussion- > Implement-Partition-Table-Feature-tp10938p11151.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. > |
Administrator
|
In reply to this post by Jacky Li
Hi
Agree, we don't need to add special constrain for one column be together existing in partition key and SORT_COLUMNS. But from actual case, don't suggest giving partition key is same as the first column of SORT_COLUMNS, maybe we need to add the tips to partition feature's document. Regards Liang
|
Free forum by Nabble | Edit this page |