Apache CarbonData Dev Mailing List archive

[Discussion] Carbon Store abstraction

Classic

List

Threaded

7 messages Options

Jacky Li

Oct 19, 2017; 2:25pm

[Discussion] Carbon Store abstraction

Hi community,

I am proposing to create a carbondata-store module to abstract the carbon store concept. The reason is:

1. Initially, carbon is designed as a file format, as it evolves to provide more features, it implemented more and more functionalities in the spark integration module. However, as community is trying to integrate more and more compute framework with carbon, these functionalities is duplicated across integration layer. Idealy, these functionality can be unified and provided in one place.

2. The current interface of carbondata exposed to user is through SQL, but the developer interface for developers who want to do compute engine integration is not very clear.

3. There are many SQL command that carbon supported, but they are implemented through spark RDD only. It is not sharable across compute framework.

Due to these reasons, for the long term future of carbondata, I think it is better to abstract the interface for compute engine integration within a new module called carbondata-store. It can wrap all store level functionalities that above file format in an independent module of compute engine, so that every integration module can depends on it and duplicate code is removed.

This is a continuous effort for long term, I will break this work into subtask and start it by creating JIRA issue, if you agree.

Regards,
Jacky Li

Liang Chen

Oct 20, 2017; 7:31am

Re: [Discussion] Carbon Store abstraction

Administrator

Hi

Thank you started this discussion. agree, for exposing the clear interface
to users, there are some optimization works.

Can you list the more detail about your proposal? for example: what class
you propose to move to carbon store, what api you propose to create and
expose to users.
I suggest we can discuss and confirm your proposal in dev first, then start
to create sub task in Jira.

Regards
Liang

Jacky Li wrote

> Hi community,
>
> I am proposing to create a carbondata-store module to abstract the carbon
> store concept. The reason is:
>
> 1. Initially, carbon is designed as a file format, as it evolves to
> provide more features, it implemented more and more functionalities in the
> spark integration module. However, as community is trying to integrate
> more and more compute framework with carbon, these functionalities is
> duplicated across integration layer. Idealy, these functionality can be
> unified and provided in one place.
>
> 2. The current interface of carbondata exposed to user is through SQL, but
> the developer interface for developers who want to do compute engine
> integration is not very clear.
>
> 3. There are many SQL command that carbon supported, but they are
> implemented through spark RDD only. It is not sharable across compute
> framework.
>
> Due to these reasons, for the long term future of carbondata, I think it
> is better to abstract the interface for compute engine integration within
> a new module called carbondata-store. It can wrap all store level
> functionalities that above file format in an independent module of compute
> engine, so that every integration module can depends on it and duplicate
> code is removed.
>
> This is a continuous effort for long term, I will break this work into
> subtask and start it by creating JIRA issue, if you agree.
>
> Regards,
> Jacky Li

... [show rest of quote]

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

sraghunandan

Oct 20, 2017; 8:56am

Re: [Discussion] Carbon Store abstraction

I think we need to integrate with presto hive and then refactor.this gives
clear idea on what we want to achieve.each processing engine is different
in its own way and integrating first would give us a clear idea on what’s
required in CarbonData
On Fri, 20 Oct 2017 at 1:01 PM, Liang Chen <[hidden email]> wrote:

> Hi
>
> Thank you started this discussion. agree, for exposing the clear interface
> to users, there are some optimization works.
>
> Can you list the more detail about your proposal? for example: what class
> you propose to move to carbon store, what api you propose to create and
> expose to users.
> I suggest we can discuss and confirm your proposal in dev first, then
> start
> to create sub task in Jira.
>
> Regards
> Liang
>
>
> Jacky Li wrote
> > Hi community,
> >
> > I am proposing to create a carbondata-store module to abstract the carbon
> > store concept. The reason is:
> >
> > 1. Initially, carbon is designed as a file format, as it evolves to
> > provide more features, it implemented more and more functionalities in
> the
> > spark integration module. However, as community is trying to integrate
> > more and more compute framework with carbon, these functionalities is
> > duplicated across integration layer. Idealy, these functionality can be
> > unified and provided in one place.
> >
> > 2. The current interface of carbondata exposed to user is through SQL,
> but
> > the developer interface for developers who want to do compute engine
> > integration is not very clear.
> >
> > 3. There are many SQL command that carbon supported, but they are
> > implemented through spark RDD only. It is not sharable across compute
> > framework.
> >
> > Due to these reasons, for the long term future of carbondata, I think it
> > is better to abstract the interface for compute engine integration within
> > a new module called carbondata-store. It can wrap all store level
> > functionalities that above file format in an independent module of
> compute
> > engine, so that every integration module can depends on it and duplicate
> > code is removed.
> >
> > This is a continuous effort for long term, I will break this work into
> > subtask and start it by creating JIRA issue, if you agree.
> >
> > Regards,
> > Jacky Li
>
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

... [show rest of quote]

Jacky Li

Oct 20, 2017; 8:57am

Re: [Discussion] Carbon Store abstraction

In reply to this post by Liang Chen

Hi All,

To provide clear API and avoid cyclic dependency between modules, the overall design will look like following diagram:

Hive-integration ---
\
Spark-integration----
\
———> carbondata-store ————————> carbondata-metadata
\ /\ /\ /\
\ / | |
\ / | |
|———> carbondata-table / |
|———> carbondata-core —/ |
|———> carbondata-processing —/

There are three new modules:
1. Carbondata-store: The main purpose of carbondata-store is to provide public interface to all integration module. It is a very thin module
2. Carbondata-table: It implements interface defines in carbondata-store, it provides table level concept abstraction like schema, segment, etc.
3. Carbondata-metadata: It holds all metadata class that need to be shared in all modules, metadata like TableInfo object

In order to provide a clean API, ONLY carbondata-store and carbondata-metadata should provide public API, other modules like carbondata-table, carbondata-processing should not expose any public class and method, they are just implementing interface defined by carbondata-store.
This also means that if we find there are public class or method in carbondata-core, carbondata-table or carbondata-processing, it means they should be refactored and move to either carbondata-metadata or carbondata-store.

The public API provided by carbondata-store should includes:

Table management:
Initialize and persist table metadata when integration module create table. Currently, the metadata includes TableInfo. Table path should be specified by integration module
Delete metadata and data in table path when integration module drop table
Retrieve TableInfo from table path
Check whether table exists
Alter metadata in TableInfo
Segment management. (Segment is operated in transactional way)
Open new segment when integration module load new data
Commit segment when data operation is done successfully
Close segment when data operation failed
Delete segment when integration module drop segment
Retrieve segment information by giving segmentId
Compaction management
Compaction policy for deciding whether compaction should be carried out
Data operation (carbondata-store provides map functions in map-reduce manner)
Data loading map function
Delete segment map function
other operation that involves map side operation. (basically, it is the internalComputefunction in all RDD in current spark integration module)

This is the current idea, please advise

Regards,
Jacky Li

> 在 2017年10月20日，下午3:31，Liang Chen <[hidden email]> 写道：
>
> Hi
>
> Thank you started this discussion. agree, for exposing the clear interface
> to users, there are some optimization works.
>
> Can you list the more detail about your proposal? for example: what class
> you propose to move to carbon store, what api you propose to create and
> expose to users.
> I suggest we can discuss and confirm your proposal in dev first, then start
> to create sub task in Jira.
>
> Regards
> Liang
>
>
> Jacky Li wrote
>> Hi community,
>>
>> I am proposing to create a carbondata-store module to abstract the carbon
>> store concept. The reason is:
>>
>> 1. Initially, carbon is designed as a file format, as it evolves to
>> provide more features, it implemented more and more functionalities in the
>> spark integration module. However, as community is trying to integrate
>> more and more compute framework with carbon, these functionalities is
>> duplicated across integration layer. Idealy, these functionality can be
>> unified and provided in one place.
>>
>> 2. The current interface of carbondata exposed to user is through SQL, but
>> the developer interface for developers who want to do compute engine
>> integration is not very clear.
>>
>> 3. There are many SQL command that carbon supported, but they are
>> implemented through spark RDD only. It is not sharable across compute
>> framework.
>>
>> Due to these reasons, for the long term future of carbondata, I think it
>> is better to abstract the interface for compute engine integration within
>> a new module called carbondata-store. It can wrap all store level
>> functionalities that above file format in an independent module of compute
>> engine, so that every integration module can depends on it and duplicate
>> code is removed.
>>
>> This is a continuous effort for long term, I will break this work into
>> subtask and start it by creating JIRA issue, if you agree.
>>
>> Regards,
>> Jacky Li
>
>
>
>
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

... [show rest of quote]

Jacky Li

Oct 20, 2017; 9:12am

Re: [Discussion] Carbon Store abstraction

In reply to this post by sraghunandan

The markup format in earlier mail is incorrect. Please refer to this one.

carbondata-store is responsible to provide following interface:
1. Table management:
- Initialize and persist table metadata when integration module create table. Currently, the metadata includes `TableInfo`. Table path should be specified by integration module
- Delete metadata and data in table path when integration module drop table
- Retrieve `TableInfo` from table path
- Check whether table exists
- Alter metadata in `TableInfo`
2. Segment management. (Segment is operated in transactional way)
- Open new segment when integration module load new data
- Commit segment when data operation is done successfully
- Close segment when data operation failed
- Delete segment when integration module drop segment
- Retrieve segment information by giving segmentId
3. Compaction management
- Compaction policy for deciding whether compaction should be carried out
4. Data operation (carbondata-store provides map functions in map-reduce manner)
- Data loading map function
- Delete segment map function
- other operation that involves map side operation. (basically, it is the `internalCompute` function in all RDD in current spark integration module)

> 在 2017年10月20日，下午4:56，Raghunandan S <[hidden email]> 写道：
>
> I think we need to integrate with presto hive and then refactor.this gives
> clear idea on what we want to achieve.each processing engine is different
> in its own way and integrating first would give us a clear idea on what’s
> required in CarbonData
> On Fri, 20 Oct 2017 at 1:01 PM, Liang Chen <[hidden email]> wrote:
>
>> Hi
>>
>> Thank you started this discussion. agree, for exposing the clear interface
>> to users, there are some optimization works.
>>
>> Can you list the more detail about your proposal? for example: what class
>> you propose to move to carbon store, what api you propose to create and
>> expose to users.
>> I suggest we can discuss and confirm your proposal in dev first, then
>> start
>> to create sub task in Jira.
>>
>> Regards
>> Liang
>>
>>
>> Jacky Li wrote
>>> Hi community,
>>>
>>> I am proposing to create a carbondata-store module to abstract the carbon
>>> store concept. The reason is:
>>>
>>> 1. Initially, carbon is designed as a file format, as it evolves to
>>> provide more features, it implemented more and more functionalities in
>> the
>>> spark integration module. However, as community is trying to integrate
>>> more and more compute framework with carbon, these functionalities is
>>> duplicated across integration layer. Idealy, these functionality can be
>>> unified and provided in one place.
>>>
>>> 2. The current interface of carbondata exposed to user is through SQL,
>> but
>>> the developer interface for developers who want to do compute engine
>>> integration is not very clear.
>>>
>>> 3. There are many SQL command that carbon supported, but they are
>>> implemented through spark RDD only. It is not sharable across compute
>>> framework.
>>>
>>> Due to these reasons, for the long term future of carbondata, I think it
>>> is better to abstract the interface for compute engine integration within
>>> a new module called carbondata-store. It can wrap all store level
>>> functionalities that above file format in an independent module of
>> compute
>>> engine, so that every integration module can depends on it and duplicate
>>> code is removed.
>>>
>>> This is a continuous effort for long term, I will break this work into
>>> subtask and start it by creating JIRA issue, if you agree.
>>>
>>> Regards,
>>> Jacky Li
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>>

... [show rest of quote]

ravipesala

Oct 20, 2017; 12:34pm

Re: [Discussion] Carbon Store abstraction

Hi Jacky,

Thank you for steering this activity. Yes, there is a need to refactor code
to get the store management out of spark integration module. It becomes
difficult to add another integration module if there is no clear API for
store management.
Please find my comments.
1. Is it really necessary to extract three modules, I think we can create
carbon-store-management module and keep the table, segment, compaction and
data management into it.
2. And also we better name current carbon-core module to carbon-scan or
carbon-io module since we are extracting all store management out of it.
3. Even table status creation and updating also should be belonged to
segment management.
4. I think data loading map function is carbonoutputformat and it should
belong to carbon-processing and carbon-hadoop modules.

I think it is better to have interface document for the public APIs we are
going to expose so that it would be easy to check presto and hive
integration needs are going to be satisfied or not.
Since it is a big work we better split the jiras in such a way that it is
independent of each other and we can do it across versions also. And also
multiple people can do this parallel.

Regards,
Ravindra.

On 20 October 2017 at 14:42, Jacky Li <[hidden email]> wrote:

> The markup format in earlier mail is incorrect. Please refer to this one.
>
> carbondata-store is responsible to provide following interface:
> 1. Table management:
> - Initialize and persist table metadata when integration module create
> table. Currently, the metadata includes `TableInfo`. Table path should be
> specified by integration module
> - Delete metadata and data in table path when integration module drop
> table
> - Retrieve `TableInfo` from table path
> - Check whether table exists
> - Alter metadata in `TableInfo`
> 2. Segment management. (Segment is operated in transactional way)
> - Open new segment when integration module load new data
> - Commit segment when data operation is done successfully
> - Close segment when data operation failed
> - Delete segment when integration module drop segment
> - Retrieve segment information by giving segmentId
> 3. Compaction management
> - Compaction policy for deciding whether compaction should be carried
> out
> 4. Data operation (carbondata-store provides map functions in map-reduce
> manner)
> - Data loading map function
> - Delete segment map function
> - other operation that involves map side operation. (basically, it is
> the `internalCompute` function in all RDD in current spark integration
> module)
>
>
> > 在 2017年10月20日，下午4:56，Raghunandan S <[hidden email]>
> 写道：
> >
> > I think we need to integrate with presto hive and then refactor.this
> gives
> > clear idea on what we want to achieve.each processing engine is different
> > in its own way and integrating first would give us a clear idea on what’s
> > required in CarbonData
> > On Fri, 20 Oct 2017 at 1:01 PM, Liang Chen <[hidden email]>
> wrote:
> >
> >> Hi
> >>
> >> Thank you started this discussion. agree, for exposing the clear
> interface
> >> to users, there are some optimization works.
> >>
> >> Can you list the more detail about your proposal? for example: what
> class
> >> you propose to move to carbon store, what api you propose to create and
> >> expose to users.
> >> I suggest we can discuss and confirm your proposal in dev first, then
> >> start
> >> to create sub task in Jira.
> >>
> >> Regards
> >> Liang
> >>
> >>
> >> Jacky Li wrote
> >>> Hi community,
> >>>
> >>> I am proposing to create a carbondata-store module to abstract the
> carbon
> >>> store concept. The reason is:
> >>>
> >>> 1. Initially, carbon is designed as a file format, as it evolves to
> >>> provide more features, it implemented more and more functionalities in
> >> the
> >>> spark integration module. However, as community is trying to integrate
> >>> more and more compute framework with carbon, these functionalities is
> >>> duplicated across integration layer. Idealy, these functionality can be
> >>> unified and provided in one place.
> >>>
> >>> 2. The current interface of carbondata exposed to user is through SQL,
> >> but
> >>> the developer interface for developers who want to do compute engine
> >>> integration is not very clear.
> >>>
> >>> 3. There are many SQL command that carbon supported, but they are
> >>> implemented through spark RDD only. It is not sharable across compute
> >>> framework.
> >>>
> >>> Due to these reasons, for the long term future of carbondata, I think
> it
> >>> is better to abstract the interface for compute engine integration
> within
> >>> a new module called carbondata-store. It can wrap all store level
> >>> functionalities that above file format in an independent module of
> >> compute
> >>> engine, so that every integration module can depends on it and
> duplicate
> >>> code is removed.
> >>>
> >>> This is a continuous effort for long term, I will break this work into
> >>> subtask and start it by creating JIRA issue, if you agree.
> >>>
> >>> Regards,
> >>> Jacky Li
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sent from:
> >> http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/
> >>
>
>

... [show rest of quote]

--
Thanks & Regards,
Ravi

Jacky Li

Oct 21, 2017; 7:14am

Re: [Discussion] Carbon Store abstraction

> 在 2017年10月20日，下午8:34，Ravindra Pesala <[hidden email]> 写道：
>
> Hi Jacky,
>
> Thank you for steering this activity. Yes, there is a need to refactor code
> to get the store management out of spark integration module. It becomes
> difficult to add another integration module if there is no clear API for
> store management.
> Please find my comments.
> 1. Is it really necessary to extract three modules, I think we can create
> carbon-store-management module and keep the table, segment, compaction and
> data management into it.
> 2. And also we better name current carbon-core module to carbon-scan or
> carbon-io module since we are extracting all store management out of it.
> 3. Even table status creation and updating also should be belonged to
> segment management.

... [show rest of quote]

Agree with above 3 points. The only problem is that I worried if there will have cyclic dependency. I think we can start doing by extracting only one new module, if there is cyclic dependency, then we need to extract one more module or put the common class into carbon-common.
Let is rename carbon-core to carbon-io

> 4. I think data loading map function is carbonoutputformat and it should
> belong to carbon-processing and carbon-hadoop modules.

Those map functions need to be present in carbon-processing, and we will put CarbonTableOuputFormat in carbon-hadoop.

>
> I think it is better to have interface document for the public APIs we are
> going to expose so that it would be easy to check presto and hive
> integration needs are going to be satisfied or not.
> Since it is a big work we better split the jiras in such a way that it is
> independent of each other and we can do it across versions also. And also
> multiple people can do this parallel.

Sure, I will write interface document for this work.

>
> Regards,
> Ravindra.
>
> On 20 October 2017 at 14:42, Jacky Li <[hidden email]> wrote:
>
>> The markup format in earlier mail is incorrect. Please refer to this one.
>>
>> carbondata-store is responsible to provide following interface:
>> 1. Table management:
>> - Initialize and persist table metadata when integration module create
>> table. Currently, the metadata includes `TableInfo`. Table path should be
>> specified by integration module
>> - Delete metadata and data in table path when integration module drop
>> table
>> - Retrieve `TableInfo` from table path
>> - Check whether table exists
>> - Alter metadata in `TableInfo`
>> 2. Segment management. (Segment is operated in transactional way)
>> - Open new segment when integration module load new data
>> - Commit segment when data operation is done successfully
>> - Close segment when data operation failed
>> - Delete segment when integration module drop segment
>> - Retrieve segment information by giving segmentId
>> 3. Compaction management
>> - Compaction policy for deciding whether compaction should be carried
>> out
>> 4. Data operation (carbondata-store provides map functions in map-reduce
>> manner)
>> - Data loading map function
>> - Delete segment map function
>> - other operation that involves map side operation. (basically, it is
>> the `internalCompute` function in all RDD in current spark integration
>> module)
>>
>>
>>> 在 2017年10月20日，下午4:56，Raghunandan S <[hidden email]>
>> 写道：
>>>
>>> I think we need to integrate with presto hive and then refactor.this
>> gives
>>> clear idea on what we want to achieve.each processing engine is different
>>> in its own way and integrating first would give us a clear idea on what’s
>>> required in CarbonData
>>> On Fri, 20 Oct 2017 at 1:01 PM, Liang Chen <[hidden email]>
>> wrote:
>>>
>>>> Hi
>>>>
>>>> Thank you started this discussion. agree, for exposing the clear
>> interface
>>>> to users, there are some optimization works.
>>>>
>>>> Can you list the more detail about your proposal? for example: what
>> class
>>>> you propose to move to carbon store, what api you propose to create and
>>>> expose to users.
>>>> I suggest we can discuss and confirm your proposal in dev first, then
>>>> start
>>>> to create sub task in Jira.
>>>>
>>>> Regards
>>>> Liang
>>>>
>>>>
>>>> Jacky Li wrote
>>>>> Hi community,
>>>>>
>>>>> I am proposing to create a carbondata-store module to abstract the
>> carbon
>>>>> store concept. The reason is:
>>>>>
>>>>> 1. Initially, carbon is designed as a file format, as it evolves to
>>>>> provide more features, it implemented more and more functionalities in
>>>> the
>>>>> spark integration module. However, as community is trying to integrate
>>>>> more and more compute framework with carbon, these functionalities is
>>>>> duplicated across integration layer. Idealy, these functionality can be
>>>>> unified and provided in one place.
>>>>>
>>>>> 2. The current interface of carbondata exposed to user is through SQL,
>>>> but
>>>>> the developer interface for developers who want to do compute engine
>>>>> integration is not very clear.
>>>>>
>>>>> 3. There are many SQL command that carbon supported, but they are
>>>>> implemented through spark RDD only. It is not sharable across compute
>>>>> framework.
>>>>>
>>>>> Due to these reasons, for the long term future of carbondata, I think
>> it
>>>>> is better to abstract the interface for compute engine integration
>> within
>>>>> a new module called carbondata-store. It can wrap all store level
>>>>> functionalities that above file format in an independent module of
>>>> compute
>>>>> engine, so that every integration module can depends on it and
>> duplicate
>>>>> code is removed.
>>>>>
>>>>> This is a continuous effort for long term, I will break this work into
>>>>> subtask and start it by creating JIRA issue, if you agree.
>>>>>
>>>>> Regards,
>>>>> Jacky Li
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from:
>>>> http://apache-carbondata-dev-mailing-list-archive.1130556.
>> n5.nabble.com/
>>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi

... [show rest of quote]