Apache CarbonData Dev Mailing List archive

[DISCUSSION] CarbonData storage service

Classic

List

Threaded

4 messages Options

Jacky Li

May 14, 2017; 4:19am

[DISCUSSION] CarbonData storage service

Hi community,

Partition feature is proposed by Cao Lu in thread (http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-td10938.html#a11321 <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Implement-Partition-Table-Feature-td10938.html#a11321>), implementation effort is on going.

After partition is implemented, point query using sort columns is expected to be faster than current B-Tree index approach. To further boost its performance and achieve higher concurrency, I want to discuss to provide a service for CarbonData.

Following is the proposal:

CarbonData Storage Service
At the moment, CarbonData project mainly defines a columnar format with index support. These CarbonData files are read and write in process framework (like in spark executor), they are efficient for OLAP/DataWarehouse kind of workload, however, there are overheads for simple query like point queries. For example, in spark, DAG break down, Task scheduling, Task serialization/deserialization is inevitable. Furthermore, executor memory is meant for control by spark core, while CarbonData requires its own memory cache.

So, in order to improve on it, I suggest to add a Storage Service in CarbonData project. The main goal of this service is to serve point query and manage carbon data storage.

1. Deployment
This service can be embedded in process framework (spark executor) like current way, or deploy a new self-managed process in HDFS data node. For latter approach, we can implement a YARN application to manage these processes.

2. Communication
There will be service client communicate with service. One simple approach is re-use the current netty RPC framework we have for dictionary generation in single-pass loading. We need to add configure for RPC ports for this service.

3. Functionality
I can think of a few functionalities that this service can provide, you can suggest more.
1) Serving point query
The query filter is consist of PARTITION_COLUMN and SORT_COLUMN, the client will send a RPC request to the service, the service open the request file and locate the offset by SORT_COLUMN and start scanning. The reading of CarbonData remains no change as in current CarbonData RecordReader. Once result data is collected, return it through RPC response to the client.
By optimizing client and service side handling and its payload in RPC, it should be more efficient than spark Task.

2) Cache management
Currently, CarbonData caches file level index in spark executor, this is not desired especially dynamic allocation is enabled in spark. By adding this Storage Service, CarbonData can have better management of this cache inside its own memory space. Besides index cache, we can also consider to add cache for hot block/blocklet, so that further reducing IO and latency.

3) Compaction management
Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user can use it to force NO SORT for the table to make loading faster. And there is option for BATCH_SORT also. By adding this service, we can implement some policy in the service to trigger compaction to do larger scope sorting than its initial loading.

We may identify and add more functionality in this service in the future.

How do you think about this idea?

Regards,
Jacky

ravipesala

May 15, 2017; 5:26am

Re: [DISCUSSION] CarbonData storage service

HI Jacky,

Implementing, Maintaining and productizing of having our own cluster is
very high. Better first we should test how much latency in spark scheduling
for point queries and we can optimize it in current design. There are few
insights people have done to improve concurrency of spark queries please
check http://velvia.github.io/Spark-Concurrent-Fast-Queries/.

And regarding metadata as separate service , I think we had separate
discussion for this. We thought of moving complete driver and executor
btree to driver side and eventually it could be moved to separate dedicated
service later to handle only metadata.

Regards,
Ravindra.

On 14 May 2017 at 09:49, Jacky Li <[hidden email]> wrote:

> Hi community,
>
> Partition feature is proposed by Cao Lu in thread (
> http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/Discussion-Implement-Partition-Table-
> Feature-td10938.html#a11321 <http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> Implement-Partition-Table-Feature-td10938.html#a11321>), implementation
> effort is on going.
>
> After partition is implemented, point query using sort columns is expected
> to be faster than current B-Tree index approach. To further boost its
> performance and achieve higher concurrency, I want to discuss to provide a
> service for CarbonData.
>
> Following is the proposal:
>
> CarbonData Storage Service
> At the moment, CarbonData project mainly defines a columnar format with
> index support. These CarbonData files are read and write in process
> framework (like in spark executor), they are efficient for
> OLAP/DataWarehouse kind of workload, however, there are overheads for
> simple query like point queries. For example, in spark, DAG break down,
> Task scheduling, Task serialization/deserialization is inevitable.
> Furthermore, executor memory is meant for control by spark core, while
> CarbonData requires its own memory cache.
>
> So, in order to improve on it, I suggest to add a Storage Service in
> CarbonData project. The main goal of this service is to serve point query
> and manage carbon data storage.
>
> 1. Deployment
> This service can be embedded in process framework (spark executor) like
> current way, or deploy a new self-managed process in HDFS data node. For
> latter approach, we can implement a YARN application to manage these
> processes.
>
> 2. Communication
> There will be service client communicate with service. One simple approach
> is re-use the current netty RPC framework we have for dictionary generation
> in single-pass loading. We need to add configure for RPC ports for this
> service.
>
> 3. Functionality
> I can think of a few functionalities that this service can provide, you
> can suggest more.
> 1) Serving point query
> The query filter is consist of PARTITION_COLUMN and SORT_COLUMN,
> the client will send a RPC request to the service, the service open the
> request file and locate the offset by SORT_COLUMN and start scanning. The
> reading of CarbonData remains no change as in current CarbonData
> RecordReader. Once result data is collected, return it through RPC
> response to the client.
> By optimizing client and service side handling and its payload in
> RPC, it should be more efficient than spark Task.
>
> 2) Cache management
> Currently, CarbonData caches file level index in spark executor,
> this is not desired especially dynamic allocation is enabled in spark. By
> adding this Storage Service, CarbonData can have better management of this
> cache inside its own memory space. Besides index cache, we can also
> consider to add cache for hot block/blocklet, so that further reducing IO
> and latency.
>
> 3) Compaction management
> Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user
> can use it to force NO SORT for the table to make loading faster. And there
> is option for BATCH_SORT also. By adding this service, we can implement
> some policy in the service to trigger compaction to do larger scope sorting
> than its initial loading.
>
> We may identify and add more functionality in this service in the future.
>
> How do you think about this idea?
>
> Regards,
> Jacky
>
>
>

... [show rest of quote]

--
Thanks & Regards,
Ravi

Liang Chen

May 16, 2017; 10:08am

Re: [DISCUSSION] CarbonData storage service

Administrator

In reply to this post by Jacky Li

Hi jacky

One question : Can you explain that proposed CarbonData Storage Service
would store what information? For users how to pre-configure memory
resource for the service? as big as possible memory?
--------------------------------------------------------------------------------------------------------
while CarbonData requires its own memory cache.

Regards
Liang

2017-05-14 0:19 GMT-04:00 Jacky Li <[hidden email]>:

... [show rest of quote]

--
Regards
Liang

Jacky Li

May 16, 2017; 1:13pm

Re: [DISCUSSION] CarbonData storage service

Hi Liang,

The storage service will serve as a long running service, it can store the blocklet index in memory and serve file read request.

And I agree to Ravindra’s opinion, that it requires much effort if we go this way. Also, after realizing it may impact the scale of the system, I think we need a second thought on this storage service.

Regards,
Jacky

> 在 2017年5月16日，下午6:08，Liang Chen <[hidden email]> 写道：
>
> Hi jacky
>
>
> One question : Can you explain that proposed CarbonData Storage Service
> would store what information? For users how to pre-configure memory
> resource for the service? as big as possible memory?
> --------------------------------------------------------------------------------------------------------
> while CarbonData requires its own memory cache.
>
> Regards
> Liang
>
>
> 2017-05-14 0:19 GMT-04:00 Jacky Li <[hidden email]>:
>
>> Hi community,
>>
>> Partition feature is proposed by Cao Lu in thread (
>> http://apache-carbondata-dev-mailing-list-archive.1130556.
>> n5.nabble.com/Discussion-Implement-Partition-Table-
>> Feature-td10938.html#a11321 <http://apache-carbondata-dev-
>> mailing-list-archive.1130556.n5.nabble.com/Discussion-
>> Implement-Partition-Table-Feature-td10938.html#a11321>), implementation
>> effort is on going.
>>
>> After partition is implemented, point query using sort columns is expected
>> to be faster than current B-Tree index approach. To further boost its
>> performance and achieve higher concurrency, I want to discuss to provide a
>> service for CarbonData.
>>
>> Following is the proposal:
>>
>> CarbonData Storage Service
>> At the moment, CarbonData project mainly defines a columnar format with
>> index support. These CarbonData files are read and write in process
>> framework (like in spark executor), they are efficient for
>> OLAP/DataWarehouse kind of workload, however, there are overheads for
>> simple query like point queries. For example, in spark, DAG break down,
>> Task scheduling, Task serialization/deserialization is inevitable.
>> Furthermore, executor memory is meant for control by spark core, while
>> CarbonData requires its own memory cache.
>>
>> So, in order to improve on it, I suggest to add a Storage Service in
>> CarbonData project. The main goal of this service is to serve point query
>> and manage carbon data storage.
>>
>> 1. Deployment
>> This service can be embedded in process framework (spark executor) like
>> current way, or deploy a new self-managed process in HDFS data node. For
>> latter approach, we can implement a YARN application to manage these
>> processes.
>>
>> 2. Communication
>> There will be service client communicate with service. One simple approach
>> is re-use the current netty RPC framework we have for dictionary generation
>> in single-pass loading. We need to add configure for RPC ports for this
>> service.
>>
>> 3. Functionality
>> I can think of a few functionalities that this service can provide, you
>> can suggest more.
>> 1) Serving point query
>> The query filter is consist of PARTITION_COLUMN and SORT_COLUMN,
>> the client will send a RPC request to the service, the service open the
>> request file and locate the offset by SORT_COLUMN and start scanning. The
>> reading of CarbonData remains no change as in current CarbonData
>> RecordReader. Once result data is collected, return it through RPC
>> response to the client.
>> By optimizing client and service side handling and its payload in
>> RPC, it should be more efficient than spark Task.
>>
>> 2) Cache management
>> Currently, CarbonData caches file level index in spark executor,
>> this is not desired especially dynamic allocation is enabled in spark. By
>> adding this Storage Service, CarbonData can have better management of this
>> cache inside its own memory space. Besides index cache, we can also
>> consider to add cache for hot block/blocklet, so that further reducing IO
>> and latency.
>>
>> 3) Compaction management
>> Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user
>> can use it to force NO SORT for the table to make loading faster. And there
>> is option for BATCH_SORT also. By adding this service, we can implement
>> some policy in the service to trigger compaction to do larger scope sorting
>> than its initial loading.
>>
>> We may identify and add more functionality in this service in the future.
>>
>> How do you think about this idea?
>>
>> Regards,
>> Jacky
>>
>>
>>
>
>
>
>
> --
> Regards
> Liang

... [show rest of quote]