Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION] CarbonData storage service

Posted by Jacky Li on May 16, 2017; 1:13pm
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-CarbonData-storage-service-tp12634p12741.html

Hi Liang,

The storage service will serve as a long running service, it can store the blocklet index in memory and serve file read request.

And I agree to Ravindra’s opinion, that it requires much effort if we go this way. Also, after realizing it may impact the scale of the system, I think we need a second thought on this storage service.

Regards,
Jacky

> 在 2017年5月16日，下午6:08，Liang Chen <[hidden email]> 写道：
>
> Hi jacky
>
>
> One question : Can you explain that proposed CarbonData Storage Service
> would store what information? For users how to pre-configure memory
> resource for the service? as big as possible memory?
> --------------------------------------------------------------------------------------------------------
> while CarbonData requires its own memory cache.
>
> Regards
> Liang
>
>
> 2017-05-14 0:19 GMT-04:00 Jacky Li <[hidden email]>:
>
>> Hi community,
>>
>> Partition feature is proposed by Cao Lu in thread (
>> http://apache-carbondata-dev-mailing-list-archive.1130556.
>> n5.nabble.com/Discussion-Implement-Partition-Table-
>> Feature-td10938.html#a11321 <http://apache-carbondata-dev-
>> mailing-list-archive.1130556.n5.nabble.com/Discussion-
>> Implement-Partition-Table-Feature-td10938.html#a11321>), implementation
>> effort is on going.
>>
>> After partition is implemented, point query using sort columns is expected
>> to be faster than current B-Tree index approach. To further boost its
>> performance and achieve higher concurrency, I want to discuss to provide a
>> service for CarbonData.
>>
>> Following is the proposal:
>>
>> CarbonData Storage Service
>> At the moment, CarbonData project mainly defines a columnar format with
>> index support. These CarbonData files are read and write in process
>> framework (like in spark executor), they are efficient for
>> OLAP/DataWarehouse kind of workload, however, there are overheads for
>> simple query like point queries. For example, in spark, DAG break down,
>> Task scheduling, Task serialization/deserialization is inevitable.
>> Furthermore, executor memory is meant for control by spark core, while
>> CarbonData requires its own memory cache.
>>
>> So, in order to improve on it, I suggest to add a Storage Service in
>> CarbonData project. The main goal of this service is to serve point query
>> and manage carbon data storage.
>>
>> 1. Deployment
>> This service can be embedded in process framework (spark executor) like
>> current way, or deploy a new self-managed process in HDFS data node. For
>> latter approach, we can implement a YARN application to manage these
>> processes.
>>
>> 2. Communication
>> There will be service client communicate with service. One simple approach
>> is re-use the current netty RPC framework we have for dictionary generation
>> in single-pass loading. We need to add configure for RPC ports for this
>> service.
>>
>> 3. Functionality
>> I can think of a few functionalities that this service can provide, you
>> can suggest more.
>> 1) Serving point query
>> The query filter is consist of PARTITION_COLUMN and SORT_COLUMN,
>> the client will send a RPC request to the service, the service open the
>> request file and locate the offset by SORT_COLUMN and start scanning. The
>> reading of CarbonData remains no change as in current CarbonData
>> RecordReader. Once result data is collected, return it through RPC
>> response to the client.
>> By optimizing client and service side handling and its payload in
>> RPC, it should be more efficient than spark Task.
>>
>> 2) Cache management
>> Currently, CarbonData caches file level index in spark executor,
>> this is not desired especially dynamic allocation is enabled in spark. By
>> adding this Storage Service, CarbonData can have better management of this
>> cache inside its own memory space. Besides index cache, we can also
>> consider to add cache for hot block/blocklet, so that further reducing IO
>> and latency.
>>
>> 3) Compaction management
>> Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user
>> can use it to force NO SORT for the table to make loading faster. And there
>> is option for BATCH_SORT also. By adding this service, we can implement
>> some policy in the service to trigger compaction to do larger scope sorting
>> than its initial loading.
>>
>> We may identify and add more functionality in this service in the future.
>>
>> How do you think about this idea?
>>
>> Regards,
>> Jacky
>>
>>
>>
>
>
>
>
> --
> Regards
> Liang