[DISCUSS] Distributed CarbonStore

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Distributed CarbonStore

Ajith shetty
Hi all

Currently the CarbonStore is very tightly coupled with FileSystem interface and which runs in process JVM like in spark. We can instead make CarbonStore run as a separate service which can be accessed via network/rpc. So as a Followup of CARBONDATA-2688 (CarbonStore Java API and REST API) we can make carbon store distributed

This has some advantages.

·         Distributed CarbonStore can support parallel scanning i.e multiple tasks can start scanning data parallely, which may have a higher parallelism factor than compute layer

·         Distributed CarbonStore can support index service to multiple apps like (spark/ flink/ presto), such that index will be shared to save resource

·         Distributed CarbonStore  resource consumption is isolated from application and easily scalable to support higher workloads

·         As a future improvement, Distributed CarbonStore  can implement a query cache since it has independent resources



Distributed CarbonStore will have 2 main deployment parts:

1. A cluster of remote carbon store service

2. SDK which acts as a client for communication with store

Please provide your inputs/suggestions. If the idea sounds promising, i will go ahead and create JIRA/subJIRAs for the same

Regards
Ajith
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Distributed CarbonStore

Jacky Li
+1

I think it is a new good feature to have, but the effort to develop is quite high. I am worried about the release cycle getting longer. Can you define a roadmap for this new feature, so it can be deliver in phases across future versions.

Do you have anything in mind for the roadmap?

Regards,
Jacky
 

> 在 2018年8月2日,上午11:43,Ajith shetty <[hidden email]> 写道:
>
> Hi all
>
> Currently the CarbonStore is very tightly coupled with FileSystem interface and which runs in process JVM like in spark. We can instead make CarbonStore run as a separate service which can be accessed via network/rpc. So as a Followup of CARBONDATA-2688 (CarbonStore Java API and REST API) we can make carbon store distributed
>
> This has some advantages.
>
> ·         Distributed CarbonStore can support parallel scanning i.e multiple tasks can start scanning data parallely, which may have a higher parallelism factor than compute layer
>
> ·         Distributed CarbonStore can support index service to multiple apps like (spark/ flink/ presto), such that index will be shared to save resource
>
> ·         Distributed CarbonStore  resource consumption is isolated from application and easily scalable to support higher workloads
>
> ·         As a future improvement, Distributed CarbonStore  can implement a query cache since it has independent resources
>
>
>
> Distributed CarbonStore will have 2 main deployment parts:
>
> 1. A cluster of remote carbon store service
>
> 2. SDK which acts as a client for communication with store
>
> Please provide your inputs/suggestions. If the idea sounds promising, i will go ahead and create JIRA/subJIRAs for the same
>
> Regards
> Ajith
>



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Distributed CarbonStore

Jacky Li
Hi Ajith,

After I read https://issues.apache.org/jira/browse/CARBONDATA-2827 <https://issues.apache.org/jira/browse/CARBONDATA-2827> proposed by Ravindra, and the Distributed CarbonStore proposed in this  thread. I am trying to think “what if we decouple the whole CarbonData into different services or micro-services, and how many such services are there”

So I am having following picture after thinking:

Metadata related services:
1. Schema service: This service provides all schema-related operation like CreateTable, DropTable, ListTable, DescTable, etc, to the client SDK.  This service is backed by a storage with ACID support, for example, a database, so that all table schema is consistent for concurrent calls from multiple clients. This service could also includes the management of DataMap commands.

2. Segment service: This service provides segment level metadata operation and transaction protocol/interface to the client, which I have described in the reply to “Refactor Segment Management Interface” email thread.

3. Index service: This service provides pruning operation to the client, which accept the parameter like in “getSplit” and leverage the Datamap for that table and return the pruned block to the client. This service can read the index/datamap file from the underlying file system and object store, or it can cache it in memory just like what we are doing in Spark driver currently. This service also satisfy the need of shared index for multiple application. Many users in the community have asked for this feature before.

Data related services:
1. IO service: By IO, I mean it performs reading and writing of the CarbonData files. Its functionality is similar to what we are doing in the Spark executor today for scanning and loading with sort ability. Though it is called IO service, it can still perform operation pushdown including filter, projection, limit, and even aggregation, topN. The operations to push down is based on the integration with the upper layer compute framework. Having this service, it is also easy to add query cache to improve query performance.

2. Stream ingest service: This service provides operation to the client that “insert” the data into the CarbonStore like in a KV system like HBase. Developer can use the client in Flink/SparkStream/Kafka application to ingest the data. Furthermore, beside “insert” operation, I think “update/delete by key” is also possible with CarbonData, since these are still relational operations, if CarbonData supports concept of PrimaryKey, which there is high chance we can reuse the SORT_COLUMNS concept.

First we should discuss and finalize whether these 5 services are completed. Then we can put different implementation of it in the roadmap.  The term “service” I use here maybe misleading, one “service” actually is one bundle of logical functionality and interfaces. It does not has to be a RPC service, it can also be a JAR library that client invokes and executes in client's JVM process.

One benefit of this decouple architecture is that it may help satisfying more use scenarios, for example:
1.  Large scale data analytics scenario that requires shared index
2.  Low latency query for scenario that is compute/storage decoupled, such as remote HDFS or cloud storage
3.  Low latency for frequent query since data cache can be leveraged (in IO service)
4.  Isolated resource for realtime ingest and reading, thus impact of each other is minimum.

Again, as much benefit we can get by going for this direction, there is a lot of effort required to get there. So it is quite important that we have a tangible roadmap for it.

Regards,
Jacky


> 在 2018年8月5日,下午8:37,Jacky Li <[hidden email]> 写道:
>
> +1
>
> I think it is a new good feature to have, but the effort to develop is quite high. I am worried about the release cycle getting longer. Can you define a roadmap for this new feature, so it can be deliver in phases across future versions.
>
> Do you have anything in mind for the roadmap?
>
> Regards,
> Jacky
>
>
>> 在 2018年8月2日,上午11:43,Ajith shetty <[hidden email]> 写道:
>>
>> Hi all
>>
>> Currently the CarbonStore is very tightly coupled with FileSystem interface and which runs in process JVM like in spark. We can instead make CarbonStore run as a separate service which can be accessed via network/rpc. So as a Followup of CARBONDATA-2688 (CarbonStore Java API and REST API) we can make carbon store distributed
>>
>> This has some advantages.
>>
>> ·         Distributed CarbonStore can support parallel scanning i.e multiple tasks can start scanning data parallely, which may have a higher parallelism factor than compute layer
>>
>> ·         Distributed CarbonStore can support index service to multiple apps like (spark/ flink/ presto), such that index will be shared to save resource
>>
>> ·         Distributed CarbonStore  resource consumption is isolated from application and easily scalable to support higher workloads
>>
>> ·         As a future improvement, Distributed CarbonStore  can implement a query cache since it has independent resources
>>
>>
>>
>> Distributed CarbonStore will have 2 main deployment parts:
>>
>> 1. A cluster of remote carbon store service
>>
>> 2. SDK which acts as a client for communication with store
>>
>> Please provide your inputs/suggestions. If the idea sounds promising, i will go ahead and create JIRA/subJIRAs for the same
>>
>> Regards
>> Ajith
>>
>
>
>