Hi,
*Carbon uses tablestatus file to record segment status and details of each segment during each load. This tablestatus enables carbon to support concurrent loads and reads without data inconsistency or corruption.So it is a very important feature of carbondata and we should have clean interfaces to maintain it. Current tablestatus updation is shattered to multiple places and there is no clean interface, so I am proposing to refactor the current SegmentStatusManager interface and bringing all tablestatus operations to a single interface. This new interface allows adding table status to any other storage like DB. This is needed for S3 type object stores as these are eventually consistent. * Please check the attached design in the jira https://issues.apache.org/jira/browse/CARBONDATA-2827 Please share your ideas on it. -- Thanks & Regards, Ravi |
+1 on the idea.
Since segment manager is very important for CarbonData, and there was multiple target we wanted to achieve, so let’s first make sure we have same understanding of the refactor goal. Let me first describe what is in my mind 1. Segment metadata (TableStatus) need to be read for every query. Because we are currently storing it in file system or object storage, it impacts query performance since every query need to read these information in disk remotely. So in order to improve query performance we should move these metadata stored in DB/KV store to make it faster. Different developer may have different preference of the storage, so the interface should be extensible such that developer can adapt the Segment metadata storage to his choice. 2. The segment manager should have interface to retrieve information and write new data into it. The implementation of such interface have two flavor, one is accessing the metadata store locally, and the other is accessing via a RPC call to a metadata service that uses the store. I think there is pros and cons for both approaches. We should consider them both when designing the interfaces. 3. In one batch load, segment also represents one transaction of the data write. There should be a commit protocol provided by the Segment Manager to control the transaction, currently we rely on locking feature of HDFS and the RDD that CarbonData develops in the carbon-spark module that overwriting the TableStatus file. This approach couples with HDFS and Spark. To avoid that, I imagine the Segment Manager should provide API like “open/load/commit/close” to the developer, so that he can integrate CarbonData into any compute engine, to make CarbonData transactional write in that engine, such as in Presto, Hive. I think the Refactor of Segment Manager should at least satisfy these 3 goals. Please correct me if you are thinking differently. Regards, Jacky > 在 2018年8月4日,下午12:14,Ravindra Pesala <[hidden email]> 写道: > > Hi, > > *Carbon uses tablestatus file to record segment status and details of each > segment during each load. This tablestatus enables carbon to support > concurrent loads and reads without data inconsistency or corruption.So it > is a very important feature of carbondata and we should have clean > interfaces to maintain it. Current tablestatus updation is shattered to > multiple places and there is no clean interface, so I am proposing to > refactor the current SegmentStatusManager interface and bringing all > tablestatus operations to a single interface. This new interface allows > adding table status to any other storage like DB. This is needed for S3 > type object stores as these are eventually consistent. * > > Please check the attached design in the jira > https://issues.apache.org/jira/browse/CARBONDATA-2827 > > Please share your ideas on it. > > -- > Thanks & Regards, > Ravi > |
Hi,
I have fixed review comments and updated the design document. Please check the V2 version of document in the jira. https://issues.apache.org/jira/browse/CARBONDATA-2827 Regards, Ravi -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |