Apache CarbonData Dev Mailing List archive

[Discussion] Refactor Segment Management Interface.

Classic

List

Threaded

3 messages Options

ravipesala

Aug 04, 2018; 4:14am

[Discussion] Refactor Segment Management Interface.

300 posts

Hi,

*Carbon uses tablestatus file to record segment status and details of each
segment during each load. This tablestatus enables carbon to support
concurrent loads and reads without data inconsistency or corruption.So it
is a very important feature of carbondata and we should have clean
interfaces to maintain it. Current tablestatus updation is shattered to
multiple places and there is no clean interface, so I am proposing to
refactor the current SegmentStatusManager interface and bringing all
tablestatus operations to a single interface. This new interface allows
adding table status to any other storage like DB. This is needed for S3
type object stores as these are eventually consistent. *

Please check the attached design in the jira
https://issues.apache.org/jira/browse/CARBONDATA-2827

Please share your ideas on it.

--
Thanks & Regards,
Ravi

Jacky Li

Aug 05, 2018; 1:47pm

Re: [Discussion] Refactor Segment Management Interface.

228 posts

+1 on the idea.

Since segment manager is very important for CarbonData, and there was multiple target we wanted to achieve, so let’s first make sure we have same understanding of the refactor goal.

Let me first describe what is in my mind

1. Segment metadata (TableStatus) need to be read for every query. Because we are currently storing it in file system or object storage, it impacts query performance since every query need to read these information in disk remotely. So in order to improve query performance we should move these metadata stored in DB/KV store to make it faster. Different developer may have different preference of the storage, so the interface should be extensible such that developer can adapt the Segment metadata storage to his choice.

2. The segment manager should have interface to retrieve information and write new data into it. The implementation of such interface have two flavor, one is accessing the metadata store locally, and the other is accessing via a RPC call to a metadata service that uses the store. I think there is pros and cons for both approaches. We should consider them both when designing the interfaces.

3. In one batch load, segment also represents one transaction of the data write. There should be a commit protocol provided by the Segment Manager to control the transaction, currently we rely on locking feature of HDFS and the RDD that CarbonData develops in the carbon-spark module that overwriting the TableStatus file. This approach couples with HDFS and Spark. To avoid that, I imagine the Segment Manager should provide API like “open/load/commit/close” to the developer, so that he can integrate CarbonData into any compute engine, to make CarbonData transactional write in that engine, such as in Presto, Hive.

I think the Refactor of Segment Manager should at least satisfy these 3 goals. Please correct me if you are thinking differently.

Regards,
Jacky

> 在 2018年8月4日，下午12:14，Ravindra Pesala <[hidden email]> 写道：
>
> Hi,
>
> *Carbon uses tablestatus file to record segment status and details of each
> segment during each load. This tablestatus enables carbon to support
> concurrent loads and reads without data inconsistency or corruption.So it
> is a very important feature of carbondata and we should have clean
> interfaces to maintain it. Current tablestatus updation is shattered to
> multiple places and there is no clean interface, so I am proposing to
> refactor the current SegmentStatusManager interface and bringing all
> tablestatus operations to a single interface. This new interface allows
> adding table status to any other storage like DB. This is needed for S3
> type object stores as these are eventually consistent. *
>
> Please check the attached design in the jira
> https://issues.apache.org/jira/browse/CARBONDATA-2827
>
> Please share your ideas on it.
>
> --
> Thanks & Regards,
> Ravi
>

... [show rest of quote]

ravipesala

Aug 14, 2018; 1:33pm

Re: [Discussion] Refactor Segment Management Interface.

300 posts

Hi,

I have fixed review comments and updated the design document. Please check
the V2 version of document in the jira.
https://issues.apache.org/jira/browse/CARBONDATA-2827

Regards,
Ravi

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/