Apache CarbonData Dev Mailing List archive - Re: [Discussion] Taking the inputs for Segment Interface Refactoring

Apache CarbonData Dev Mailing List archive

Re: [Discussion] Taking the inputs for Segment Interface Refactoring

Posted by Ajantha Bhat on Nov 13, 2020; 9:13am
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Discussion-Taking-the-inputs-for-Segment-Interface-Refactoring-tp101950p103238.html

Hi Everyone.

Please find the design of refactored segment interfaces in the document attached. Also can check the same V3 version attached in the JIRA [https://issues.apache.org/jira/browse/CARBONDATA-2827]

It is based on some recent discussions and the previous discussions of 2018

[http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Refactor-Segment-Management-Interface-td58926.html]

Note:

1) As the pre-aggreage feature is not present and MV ,SI supports incremental loading. so, now the previous problem of commit all child table status at once maybe not applicable. so, removed interfaces for that.
2) All these will be developed in a new module called carbondata-acid and other required module depends on it.
3) Once this is implemented. we can discuss the design of time travel on top of it. [Transaction manager implementation and writing multiple table status files with versioning]

Please go through it and give your inputs.

Thanks,

Ajantha

On Mon, Oct 19, 2020 at 9:43 AM David CaiQiang <[hidden email]> wrote:

I list feature list about segment as follows before starting to re-factory
segment interface.

[table related]
1. get lock for table
lock for tablestatus
lock for updatedTablestatus
2. get lastModifiedTime of table

[segment related]
1. segment datasource
datasource: file format,other datasource
fileformat: carbon,parquet,orc,csv..
catalog type: segment, external segment
2. data load etl(load/insert/add_external_segment/insert_stage)
write segment for batch loading
add external segment by using external folder path for mixed file
formatted table
append streaming segment for spark structed streaming
insert_stage for flink writer
3. data query
segment properties and schema
segment level index cache and pruning
cache/refresh block/blocklet index cache if needed by segment
read segments to a dataframe/rdd
4. segment management
new segment id for loading/insert/add_external_segment/insert_stage
create global segment identifier
show[history]/delete segment
5. stats
collect dataSize and indexSize of the segment
lastModifiedTime, start/end time, update start/end time
fileFormat
status
6. segment level lock for supporting concurrent operations
7. get tablestatus storage factory
storage solution 1): use file system by default
storage solution 2): use hive metastore or db

[table status related]:
1. record new LoadMetadataDetails
loading/insert/compatcion start/end
add external segment start/end
insert stage

2. update LoadMetadataDetails
compation
update/delete
drop partition
delete segment

3. read LoadMetadataDetails
list all/valid/invalid segment

4. backup and history

[segment file related]
1. write new segment file
generate segment file name
better to use new timestamp to generate new segment file name for each
writing. avoid overwriting segment file with same name.
write semgent file
merge temp segment file
2. read segment file
readIndexFiles
readIndexMergeFiles
getPartitionSpec
3. update segment file
update
merge index
drop partition

[clean files related]
1. clean stale files for the successful segment operation
data deletion should delay a period of time(maybe query timeout
interval), avoid deleting file immediately(beside of drop table/partition,
force clean files)
include data file, index file, segment file, tablestatus file
impact operation: mergeIndex
2. clean stale files for failed segment operation immediately

-----
Best Regards
David Cai
--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/