[Discussion] Taking the inputs for Segment Interface Refactoring

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discussion] Taking the inputs for Segment Interface Refactoring

Ajantha Bhat
Hi Dev,
Multiple times we are discussing about segment interface refactoring. But
we are not moving ahead.
The final goal of this activity is to *design* *clean segment interface
that can support Time travel, concurrent operation and transaction
management. *

So, I am welcoming the problems, ideas & design for the same as many people
were having different ideas about it.
We can have a virtual design meeting if required for this.

Thanks,
Ajantha
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Taking the inputs for Segment Interface Refactoring

David CaiQiang
I list feature list about segment as follows before starting to re-factory
segment interface.

[table related]
1. get lock for table
   lock for tablestatus
   lock for updatedTablestatus
2. get lastModifiedTime of table

[segment related]
1. segment datasource
   datasource: file format,other datasource
   fileformat: carbon,parquet,orc,csv..
   catalog type: segment, external segment
2. data load etl(load/insert/add_external_segment/insert_stage)
   write segment for batch loading
   add external segment by using external folder path for mixed file
formatted table
   append streaming segment for spark structed streaming
   insert_stage for flink writer
3. data query
   segment properties and schema
   segment level index cache and pruning
   cache/refresh block/blocklet index cache if needed by segment
   read segments to a dataframe/rdd
4. segment management
   new segment id for loading/insert/add_external_segment/insert_stage
   create global segment identifier
   show[history]/delete segment
5. stats
   collect dataSize and indexSize of the segment
   lastModifiedTime, start/end time, update start/end time
   fileFormat
   status
6. segment level lock for supporting concurrent operations
7. get tablestatus storage factory
   storage solution 1): use file system by default
   storage solution 2): use hive metastore or db

[table status related]:
1. record new LoadMetadataDetails
 loading/insert/compatcion start/end
 add external segment start/end
 insert stage
 
2. update LoadMetadataDetails
  compation
  update/delete
  drop partition
  delete segment

3. read LoadMetadataDetails
  list all/valid/invalid segment

4. backup and history

[segment file related]
1. write new segment file
  generate segment file name
     better to use new timestamp to generate new segment file name for each
writing. avoid overwriting segment file with same name.
   write semgent file
   merge temp segment file
2. read segment file
   readIndexFiles
   readIndexMergeFiles
   getPartitionSpec
3. update segment file
   update
   merge index
   drop partition

[clean files related]
1. clean stale files for the successful  segment operation
   data deletion should delay a period of time(maybe query timeout
interval), avoid deleting file immediately(beside of drop table/partition,
force clean files)
   include data file, index file, segment file, tablestatus file
   impact operation: mergeIndex
2. clean stale files for failed segment operation immediately





-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Taking the inputs for Segment Interface Refactoring

Ajantha Bhat
Hi Everyone. 
Please find the design of refactored segment interfaces in the document attached. Also can check the same V3 version attached in the JIRA [https://issues.apache.org/jira/browse/CARBONDATA-2827]

It is based on some recent discussions and the previous discussions of 2018 

Note:
1) As  the pre-aggreage feature is not present and MV ,SI supports incremental loading. so, now the previous problem of commit all child table status at once maybe not applicable. so, removed interfaces for that. 
2) All these will be developed in a new module called carbondata-acid and other required module depends on it.
3) Once this is implemented. we can discuss the design of time travel on top of it. [Transaction manager implementation and writing multiple table status files with versioning]

Please go through it and give your inputs.

Thanks,
Ajantha  

On Mon, Oct 19, 2020 at 9:43 AM David CaiQiang <[hidden email]> wrote:
I list feature list about segment as follows before starting to re-factory
segment interface.

[table related]
1. get lock for table
   lock for tablestatus
   lock for updatedTablestatus
2. get lastModifiedTime of table

[segment related]
1. segment datasource
   datasource: file format,other datasource
   fileformat: carbon,parquet,orc,csv..
   catalog type: segment, external segment
2. data load etl(load/insert/add_external_segment/insert_stage)
   write segment for batch loading
   add external segment by using external folder path for mixed file
formatted table
   append streaming segment for spark structed streaming
   insert_stage for flink writer
3. data query
   segment properties and schema
   segment level index cache and pruning
   cache/refresh block/blocklet index cache if needed by segment
   read segments to a dataframe/rdd
4. segment management
   new segment id for loading/insert/add_external_segment/insert_stage
   create global segment identifier
   show[history]/delete segment
5. stats
   collect dataSize and indexSize of the segment
   lastModifiedTime, start/end time, update start/end time
   fileFormat
   status
6. segment level lock for supporting concurrent operations
7. get tablestatus storage factory
   storage solution 1): use file system by default
   storage solution 2): use hive metastore or db

[table status related]:
1. record new LoadMetadataDetails
 loading/insert/compatcion start/end
 add external segment start/end
 insert stage

2. update LoadMetadataDetails
  compation
  update/delete
  drop partition
  delete segment

3. read LoadMetadataDetails
  list all/valid/invalid segment

4. backup and history

[segment file related]
1. write new segment file
  generate segment file name
     better to use new timestamp to generate new segment file name for each
writing. avoid overwriting segment file with same name.
   write semgent file
   merge temp segment file
2. read segment file
   readIndexFiles
   readIndexMergeFiles
   getPartitionSpec
3. update segment file
   update
   merge index
   drop partition

[clean files related]
1. clean stale files for the successful  segment operation
   data deletion should delay a period of time(maybe query timeout
interval), avoid deleting file immediately(beside of drop table/partition,
force clean files)
   include data file, index file, segment file, tablestatus file
   impact operation: mergeIndex
2. clean stale files for failed segment operation immediately





-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Taking the inputs for Segment Interface Refactoring

Ajantha Bhat
Hi all,

As per the online meeting, I have thought through the design of the
transaction manager as well.
Transaction manager can be responsible for
a. Across table transaction --> expose start transaction, commit
transaction, rollback transaction to the user/application. Commit table
status file of all table once only if in all table the current transaction
successful.
b.  Table level versioning/MVCC for time travel, internally get the
transaction id (version id) for each table-level operations (DDL/DML) and
write multiple table status files for each version for time travel and also
keep one transaction file.

However, combining transactionManger with Segment interface refactoring
work will complicate things to design and handle in one PR. So, I want to
handle step by step,
*So, to handle segment interface refactoring first, Please go through the
document attached in the previous mail (also present in JIRA) and provide
your opinion (+1) to go ahead.*

Thanks,
Ajantha


On Fri, Nov 13, 2020 at 2:43 PM Ajantha Bhat <[hidden email]> wrote:

> Hi Everyone.
> Please find the design of refactored segment interfaces in the document
> attached. Also can check the same V3 version attached in the JIRA [
> https://issues.apache.org/jira/browse/CARBONDATA-2827]
>
> It is based on some recent discussions and the previous discussions of
> 2018
> [
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Refactor-Segment-Management-Interface-td58926.html
> ]
>
> *Note:*
> 1) As  the pre-aggreage feature is not present and MV ,SI supports
> incremental loading. so, now the previous problem of commit all child
> table status at once maybe not applicable. so, removed interfaces for that.
> 2) All these will be developed in a new module called *carbondata-acid*
> and other required module depends on it.
> 3) Once this is implemented. we can discuss the design of time travel on
> top of it. [Transaction manager implementation and writing multiple table
> status files with versioning]
>
> Please go through it and give your inputs.
>
> Thanks,
> Ajantha
>
> On Mon, Oct 19, 2020 at 9:43 AM David CaiQiang <[hidden email]>
> wrote:
>
>> I list feature list about segment as follows before starting to re-factory
>> segment interface.
>>
>> [table related]
>> 1. get lock for table
>>    lock for tablestatus
>>    lock for updatedTablestatus
>> 2. get lastModifiedTime of table
>>
>> [segment related]
>> 1. segment datasource
>>    datasource: file format,other datasource
>>    fileformat: carbon,parquet,orc,csv..
>>    catalog type: segment, external segment
>> 2. data load etl(load/insert/add_external_segment/insert_stage)
>>    write segment for batch loading
>>    add external segment by using external folder path for mixed file
>> formatted table
>>    append streaming segment for spark structed streaming
>>    insert_stage for flink writer
>> 3. data query
>>    segment properties and schema
>>    segment level index cache and pruning
>>    cache/refresh block/blocklet index cache if needed by segment
>>    read segments to a dataframe/rdd
>> 4. segment management
>>    new segment id for loading/insert/add_external_segment/insert_stage
>>    create global segment identifier
>>    show[history]/delete segment
>> 5. stats
>>    collect dataSize and indexSize of the segment
>>    lastModifiedTime, start/end time, update start/end time
>>    fileFormat
>>    status
>> 6. segment level lock for supporting concurrent operations
>> 7. get tablestatus storage factory
>>    storage solution 1): use file system by default
>>    storage solution 2): use hive metastore or db
>>
>> [table status related]:
>> 1. record new LoadMetadataDetails
>>  loading/insert/compatcion start/end
>>  add external segment start/end
>>  insert stage
>>
>> 2. update LoadMetadataDetails
>>   compation
>>   update/delete
>>   drop partition
>>   delete segment
>>
>> 3. read LoadMetadataDetails
>>   list all/valid/invalid segment
>>
>> 4. backup and history
>>
>> [segment file related]
>> 1. write new segment file
>>   generate segment file name
>>      better to use new timestamp to generate new segment file name for
>> each
>> writing. avoid overwriting segment file with same name.
>>    write semgent file
>>    merge temp segment file
>> 2. read segment file
>>    readIndexFiles
>>    readIndexMergeFiles
>>    getPartitionSpec
>> 3. update segment file
>>    update
>>    merge index
>>    drop partition
>>
>> [clean files related]
>> 1. clean stale files for the successful  segment operation
>>    data deletion should delay a period of time(maybe query timeout
>> interval), avoid deleting file immediately(beside of drop table/partition,
>> force clean files)
>>    include data file, index file, segment file, tablestatus file
>>    impact operation: mergeIndex
>> 2. clean stale files for failed segment operation immediately
>>
>>
>>
>>
>>
>> -----
>> Best Regards
>> David Cai
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discussion] Taking the inputs for Segment Interface Refactoring

David CaiQiang
+1



-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Best Regards
David Cai