Apache CarbonData Dev Mailing List archive

[Discussion] Roadmap for Apache CarbonData 2

Classic

List

Threaded

10 messages Options

ravipesala

[Discussion] Roadmap for Apache CarbonData 2

Hi Community,

Three years have passed since the launching of the Apache CarbonData
project, CarbonData has become a popular data management solution for
various scenarios. As new workload like AI and new runtime environment like
the cloud is emerging quickly, I think we are reaching a point that needs
to discuss the future of CarbonData.

To bring CarbonData to a new level to satisfy those new requirements, Jacky
and I drafted a roadmap for CarbonData 2 in the cwiki website.
- English Version:
https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
- Chinese Version:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492

Please feel free to discuss the roadmap in this thread, and we welcome
every feedback to make CarbonData better.

Thanks and Regards,
Ravindra.

manhua

Re: [Discussion] Roadmap for Apache CarbonData 2

currently, datamap in carbon applys to all segments.
The roadmap refers to commands like add/drop segment, and also maybe something
about incremental loading for MV. For these scenes, it is better to make
datamap can be use on segment level instead of disable the datamap when any
datamap data is not ready for any segment. Also this can make datamap
fail-safe and enhance carbon's stablility.
Maybe we can consider about this also.

-----
Regards
Manhua

---Original---
From: "Ravindra Pesala"<[hidden email]>
Date: Tue, Jul 16, 2019 22:31 PM
To: "dev"<[hidden email]>;
Subject: [Discussion] Roadmap for Apache CarbonData 2

Hi Community,

Three years have passed since the launching of the Apache CarbonData
project, CarbonData has become a popular data management solution for
various scenarios. As new workload like AI and new runtime environment like
the cloud is emerging quickly, I think we are reaching a point that needs
to discuss the future of CarbonData.

To bring CarbonData to a new level to satisfy those new requirements, Jacky
and I drafted a roadmap for CarbonData 2 in the cwiki website.
- English Version:
https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
- Chinese Version:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492

Please feel free to discuss the roadmap in this thread, and we welcome
every feedback to make CarbonData better.

Thanks and Regards,
Ravindra.

Regards
Manhua

Nicholas

Apache CarbonData 2 RoadMap Feedback

This post was updated on .

In reply to this post by ravipesala

Hi Community,

I have already read CarbonData 2 roadmap.I consider that integration with Flink of CarbonData 2 features should take more effort to focus on its implementation.As we all know,the 1.9 version of Flink will be released at the end of this month,which is merged with Blink of Alibaba.Building real-time data warehouses through the CarbonData integration of Flink will attract many engineers to use CarbonData to add more real-time artificial intelligence platform possibilities.It's just my option,and I have great interest in build integration with Flink.

Thanks,
Nicholas

ravipesala

Re: [Discussion] Roadmap for Apache CarbonData 2

In reply to this post by manhua

Hi Kevin,

Yes, we can improve it. The implementation is closely related to supporting
pre-aggregate datamaps on the streaming table which we have already
implemented some time ago. And same will be reimplemented for MV datamap
soon as well.
The implementation allows using of pre-aggregate datamap for non-streaming
segments and main table for streaming segments. We update the query plan to
do union on both the tables and query only the streaming segments for main
table.
So even in our case also we can use the same way, we can do the union query
of MV table and main table(only non loaded datamap segments) and execute
the query. We can definitely consider after we support streaming table for
MV datamap.

Regards,
Ravindra.

On Wed, 17 Jul 2019 at 07:55, kevinjmh <[hidden email]> wrote:

> currently, datamap in carbon applys to all segments.
> The roadmap refers to commands like add/drop segment, and also maybe
> something
> about incremental loading for MV. For these scenes, it is better to make
> datamap can be use on segment level instead of disable the datamap when any
> datamap data is not ready for any segment. Also this can make datamap
> fail-safe and enhance carbon's stablility.
> Maybe we can consider about this also.
>
>
>
>
> -----
> Regards
> Manhua
>
>
>
> ---Original---
> From: "Ravindra Pesala"<[hidden email]>
> Date: Tue, Jul 16, 2019 22:31 PM
> To: "dev"<[hidden email]>;
> Subject: [Discussion] Roadmap for Apache CarbonData 2
>
>
> Hi Community,
>
> Three years have passed since the launching of the Apache CarbonData
> project, CarbonData has become a popular data management solution for
> various scenarios. As new workload like AI and new runtime environment like
> the cloud is emerging quickly, I think we are reaching a point that needs
> to discuss the future of CarbonData.
>
> To bring CarbonData to a new level to satisfy those new requirements, Jacky
> and I drafted a roadmap for CarbonData 2 in the cwiki website.
> - English Version:
>
> https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
> - Chinese Version:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492
>
> Please feel free to discuss the roadmap in this thread, and we welcome
> every feedback to make CarbonData better.
>
> Thanks and Regards,
> Ravindra.

--
Thanks & Regards,
Ravi

ravipesala

Re: Apache CarbonData 2 RoadMap Feedback

In reply to this post by Nicholas

Hi,

Yes, Flink and CarbonData integration will definitely attract more users.
We welcome any contributions in that direction.

Regards,
Ravindra.

On Thu, 18 Jul 2019 at 07:55, 蒋晓峰 <[hidden email]> wrote:

> Hi Community,
>
>
>
>
> I have already read CarbonData 2 roadmap.I consider that integration
> with Flink of CarbonData 2 features should take more effort to focus on its
> implementation.As we all know,the 1.9 version of Flink will be released at
> the end of this month,which is merged with Blink of Alibaba.Building
> real-time data warehouses through the CarbonData integration of Flink will
> attract many engineers to use CarbonData to add more real-time artificial
> intelligence platform possibilities.It's just my option,and I have great
> interest in build integration with Flink together with you.
>
>
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
> Nicholas

--
Thanks & Regards,
Ravi

xubo245

回复： Apache CarbonData 2 RoadMap Feedback

There are some problem when user handle AI data. For example, it's very slow when user upload or download lots of images from S3. It need about 10 hours when user upload 10 million images(40GB) to S3 by using 1 threads. AI developer also want to manage structured data and unstructured data for their AI training algorithm and predict or others.

We already do some works on CarbonData for AI domain, the performance is great, CarbonData is faster many times than raw data when upload/download data from S3. But there still has some problem, CarbonData should support or optimize. CarbonData should be ready to support data management for AI application.

by qq mail

------------------ 原始邮件 ------------------
发件人: "Ravindra Pesala"<[hidden email]>;
发送时间: 2019年7月18日(星期四) 晚上11:26
收件人: "dev"<[hidden email]>;

主题: Re: Apache CarbonData 2 RoadMap Feedback

Hi,

Yes, Flink and CarbonData integration will definitely attract more users.
We welcome any contributions in that direction.

Regards,
Ravindra.

On Thu, 18 Jul 2019 at 07:55, 蒋晓峰 <[hidden email]> wrote:

--
Thanks & Regards,
Ravi

kumarvishal09

Re: [Discussion] Roadmap for Apache CarbonData 2

In reply to this post by ravipesala

Hi Ravi,

We can add below requirements in 2.0:

1. Data Loading performance improvement.(Need to analyze and improve)
2. Unify reading for carbon data file, currently data is read in two parts
dimension and measure because of this number of IO is more.
3. Carbon Store size optimization(Already PR is raised need to revisit) and
we can explore some more optimization(like RLE hybrid Bit Packing).
4. Presto enhancement(Like write support, Presto SQL adaptation, Complex
type read support)
5. Spark Data Source V2 integration.
6. Spatial Index Support.

-Regards
Kumar Vishal

On Thu, Jul 18, 2019 at 8:20 PM Ravindra Pesala <[hidden email]>
wrote:

> Hi Kevin,
>
> Yes, we can improve it. The implementation is closely related to supporting
> pre-aggregate datamaps on the streaming table which we have already
> implemented some time ago. And same will be reimplemented for MV datamap
> soon as well.
> The implementation allows using of pre-aggregate datamap for non-streaming
> segments and main table for streaming segments. We update the query plan to
> do union on both the tables and query only the streaming segments for main
> table.
> So even in our case also we can use the same way, we can do the union query
> of MV table and main table(only non loaded datamap segments) and execute
> the query. We can definitely consider after we support streaming table for
> MV datamap.
>
> Regards,
> Ravindra.
>
> On Wed, 17 Jul 2019 at 07:55, kevinjmh <[hidden email]> wrote:
>
> > currently, datamap in carbon applys to all segments.
> > The roadmap refers to commands like add/drop segment, and also maybe
> > something
> > about incremental loading for MV. For these scenes, it is better to make
> > datamap can be use on segment level instead of disable the datamap when
> any
> > datamap data is not ready for any segment. Also this can make datamap
> > fail-safe and enhance carbon's stablility.
> > Maybe we can consider about this also.
> >
> >
> >
> >
> > -----
> > Regards
> > Manhua
> >
> >
> >
> > ---Original---
> > From: "Ravindra Pesala"<[hidden email]>
> > Date: Tue, Jul 16, 2019 22:31 PM
> > To: "dev"<[hidden email]>;
> > Subject: [Discussion] Roadmap for Apache CarbonData 2
> >
> >
> > Hi Community,
> >
> > Three years have passed since the launching of the Apache CarbonData
> > project, CarbonData has become a popular data management solution for
> > various scenarios. As new workload like AI and new runtime environment
> like
> > the cloud is emerging quickly, I think we are reaching a point that needs
> > to discuss the future of CarbonData.
> >
> > To bring CarbonData to a new level to satisfy those new requirements,
> Jacky
> > and I drafted a roadmap for CarbonData 2 in the cwiki website.
> > - English Version:
> >
> >
> https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
> > - Chinese Version:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492
> >
> > Please feel free to discuss the roadmap in this thread, and we welcome
> > every feedback to make CarbonData better.
> >
> > Thanks and Regards,
> > Ravindra.
>
>
>
> --
> Thanks & Regards,
> Ravi
>

kumar vishal

manishgupta88

Re: [Discussion] Roadmap for Apache CarbonData 2

Hi Team

Its glad to see how Carbondata has grown and become popular over the time.
It was important to re-look and come up with a roadmap as per future needs.
Carbondata 2.0 proposal looks good as we are trying to align it with Cloud
which will be more or less the prominent run time environment in the near
future. A lot of code refactoring will be required as per the roadmap. I
would like to add a couple of points.

1. Complex type support: Although we do have complex type support there is
scope for improvement. use cases for nested columns are growing
extensively. We should work on improving the storage of nested columns and
should also support creating compound/multi column indexes for the nested
columns.
2. Feature code segregation and Pluggability: Current code is tightly
coupled. The ideal case would be to have a base and make all the features
pluggable into it but that will be hard to achieve. We can try segregation
at the package level for major features but for any new feature developed
we should think in terms of pluggability.

[Clarification] Carbon UI: I did not understand the usage of Carbon segment
management UI. For cloud scenario we will have to expose rest end points
which will make carbon more like a Microservice and that does not go along
with Carbondata use case. UI/tool makes more sense for internal testing but
not sure how it will be beneficial for end user. May be a tool showing the
data stored in each table would be more useful to the end user.

Regards
Manish Gupta

On Tue, Aug 13, 2019 at 4:51 PM Kumar Vishal <[hidden email]>
wrote:

> Hi Ravi,
>
> We can add below requirements in 2.0:
>
> 1. Data Loading performance improvement.(Need to analyze and improve)
> 2. Unify reading for carbon data file, currently data is read in two parts
> dimension and measure because of this number of IO is more.
> 3. Carbon Store size optimization(Already PR is raised need to revisit) and
> we can explore some more optimization(like RLE hybrid Bit Packing).
> 4. Presto enhancement(Like write support, Presto SQL adaptation, Complex
> type read support)
> 5. Spark Data Source V2 integration.
> 6. Spatial Index Support.
>
>
> -Regards
> Kumar Vishal
>
> On Thu, Jul 18, 2019 at 8:20 PM Ravindra Pesala <[hidden email]>
> wrote:
>
> > Hi Kevin,
> >
> > Yes, we can improve it. The implementation is closely related to
> supporting
> > pre-aggregate datamaps on the streaming table which we have already
> > implemented some time ago. And same will be reimplemented for MV datamap
> > soon as well.
> > The implementation allows using of pre-aggregate datamap for
> non-streaming
> > segments and main table for streaming segments. We update the query plan
> to
> > do union on both the tables and query only the streaming segments for
> main
> > table.
> > So even in our case also we can use the same way, we can do the union
> query
> > of MV table and main table(only non loaded datamap segments) and execute
> > the query. We can definitely consider after we support streaming table
> for
> > MV datamap.
> >
> > Regards,
> > Ravindra.
> >
> > On Wed, 17 Jul 2019 at 07:55, kevinjmh <[hidden email]> wrote:
> >
> > > currently, datamap in carbon applys to all segments.
> > > The roadmap refers to commands like add/drop segment, and also maybe
> > > something
> > > about incremental loading for MV. For these scenes, it is better to
> make
> > > datamap can be use on segment level instead of disable the datamap when
> > any
> > > datamap data is not ready for any segment. Also this can make datamap
> > > fail-safe and enhance carbon's stablility.
> > > Maybe we can consider about this also.
> > >
> > >
> > >
> > >
> > > -----
> > > Regards
> > > Manhua
> > >
> > >
> > >
> > > ---Original---
> > > From: "Ravindra Pesala"<[hidden email]>
> > > Date: Tue, Jul 16, 2019 22:31 PM
> > > To: "dev"<[hidden email]>;
> > > Subject: [Discussion] Roadmap for Apache CarbonData 2
> > >
> > >
> > > Hi Community,
> > >
> > > Three years have passed since the launching of the Apache CarbonData
> > > project, CarbonData has become a popular data management solution for
> > > various scenarios. As new workload like AI and new runtime environment
> > like
> > > the cloud is emerging quickly, I think we are reaching a point that
> needs
> > > to discuss the future of CarbonData.
> > >
> > > To bring CarbonData to a new level to satisfy those new requirements,
> > Jacky
> > > and I drafted a roadmap for CarbonData 2 in the cwiki website.
> > > - English Version:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
> > > - Chinese Version:
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492
> > >
> > > Please feel free to discuss the roadmap in this thread, and we welcome
> > > every feedback to make CarbonData better.
> > >
> > > Thanks and Regards,
> > > Ravindra.
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>

xuchuanyin

Re: [Discussion] Roadmap for Apache CarbonData 2

In reply to this post by ravipesala

Hi, so glad to see Carbondata will enter stage 2.x and I have the following
suggestions for your consideration as following:

1. Evolution for Carbondata file format.
Previously I thought one of the key highlights of Carbondata is the
Carbondata file format, is there any evolution for that?
While Carbondata steps to a broader application scopes, will the current
file format still suite well for them?

2. Performance commitment of Carbondata.
Seems that Carbondata cares more about expanding the scope of application
than the performance enhancemance.
What is the performance commitment of Carbondata 2 for dataloading&querying?
Many enterprises do have big data, but that is not BIG enough to use
cloud/datalake etc.
For these scenarios, is Carbondata performance obviously better than other
fileFormat+executionEngine combination?
Do we have any plan for the enhancement?

3. Smarter Carbondata.
As we suggested earlier, is Carbondata advisor on the roadmap?
Carbondata has many features, but I notice that part of them are never used
by the user.
While Carbondata will serve AI scope, can itself be smarter as well?
The Carbondata advisor is a DBA for Carbondata which will monitor the
workload, usage, current performance and give proper suggestions or even can
do proper operation itself.

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

melin li

Re: [Discussion] Roadmap for Apache CarbonData 2

antl4 parse sql

xuchuanyin <[hidden email]> 于2019年8月17日周六上午11:31写道：

> Hi, so glad to see Carbondata will enter stage 2.x and I have the following
> suggestions for your consideration as following:
>
> 1. Evolution for Carbondata file format.
> Previously I thought one of the key highlights of Carbondata is the
> Carbondata file format, is there any evolution for that?
> While Carbondata steps to a broader application scopes, will the current
> file format still suite well for them?
>
>
> 2. Performance commitment of Carbondata.
> Seems that Carbondata cares more about expanding the scope of application
> than the performance enhancemance.
> What is the performance commitment of Carbondata 2 for
> dataloading&querying?
> Many enterprises do have big data, but that is not BIG enough to use
> cloud/datalake etc.
> For these scenarios, is Carbondata performance obviously better than other
> fileFormat+executionEngine combination?
> Do we have any plan for the enhancement?
>
>
> 3. Smarter Carbondata.
> As we suggested earlier, is Carbondata advisor on the roadmap?
> Carbondata has many features, but I notice that part of them are never used
> by the user.
> While Carbondata will serve AI scope, can itself be smarter as well?
> The Carbondata advisor is a DBA for Carbondata which will monitor the
> workload, usage, current performance and give proper suggestions or even
> can
> do proper operation itself.
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>