Apache CarbonData Dev Mailing List archive

[Feature Proposal] Spark 2 integration with CarbonData

Classic

List

Threaded

6 messages Options

Jacky Li

[Feature Proposal] Spark 2 integration with CarbonData

Hi all,

Currently CarbonData only works with spark1.5 and spark1.6, as Apache Spark community is moving to 2.1, more and more user will deploy spark 2.x in production environment. In order to make CarbonData even more popular, I think now it is good time to start considering spark2.x integration with CarbonData.

Moreover, we can take this as a chance to refactory CarbonData to make it both easier to use and higher performance.

Usability:
Instead of using CarbonContext, in spark2 integration, user should able to
1. use native SparkSession in the spark application to create and query table backed by CarbonData files with full feature support, including index and late decode optimization.

2. use CarbonData's API and tool to acomplish carbon specific tasks, like compaction, delete segment, etc.

Perforamnce:
1. deep integration with Datasource API and leveraging spark2's whole stage codegen feature.

2. provide implementation of vectorized record reader, to improve scanning performance.

Since spark2 changes a lot comparing to spark 1.6, it may take some time to complete all these features. With the help of contributors and committers, I hope we can have basic features working in next CarbonData release.

What do you think about this idea? All kinds of contribution and suggestions are welcomed.

Regards,
Jacky Li

Liang Chen

Re: [Feature Proposal] Spark 2 integration with CarbonData

Administrator

Hi

Very excited to see that CarbonData will integrate with Spark 2.x, look forward to getting performance improved further and usability enhanced.

Regards
Liang

Jacky Li wrote

Hi all,

Currently CarbonData only works with spark1.5 and spark1.6, as Apache Spark community is moving to 2.1, more and more user will deploy spark 2.x in production environment. In order to make CarbonData even more popular, I think now it is good time to start considering spark2.x integration with CarbonData.

Moreover, we can take this as a chance to refactory CarbonData to make it both easier to use and higher performance.

Usability:
Instead of using CarbonContext, in spark2 integration, user should able to
1. use native SparkSession in the spark application to create and query table backed by CarbonData files with full feature support, including index and late decode optimization.

2. use CarbonData's API and tool to acomplish carbon specific tasks, like compaction, delete segment, etc.

Perforamnce:
1. deep integration with Datasource API and leveraging spark2's whole stage codegen feature.

2. provide implementation of vectorized record reader, to improve scanning performance.

Since spark2 changes a lot comparing to spark 1.6, it may take some time to complete all these features. With the help of contributors and committers, I hope we can have basic features working in next CarbonData release.

What do you think about this idea? All kinds of contribution and suggestions are welcomed.

Regards,
Jacky Li

Venkata Gollamudi

Re: [Feature Proposal] Spark 2 integration with CarbonData

Hi All,

+1
I agree with Jacky and it is important for CarbonData community to work on
Spark2.x. As Spark2.x has major design and interface changes. It is also
challenge to support both Spark2.x and Spark1.x. We can start creating
sub-tasks under issue(CARBONDATA-322)

Regards,
Ramana

On Sun, Nov 27, 2016 at 9:39 AM, Liang Chen <[hidden email]> wrote:

> Hi
>
> Very excited to see that CarbonData will integrate with Spark 2.x, look
> forward to getting performance improved further and usability enhanced.
>
> Regards
> Liang
>
>
> Jacky Li wrote
> > Hi all,
> >
> > Currently CarbonData only works with spark1.5 and spark1.6, as Apache
> > Spark community is moving to 2.1, more and more user will deploy spark
> 2.x
> > in production environment. In order to make CarbonData even more popular,
> > I think now it is good time to start considering spark2.x integration
> with
> > CarbonData.
> >
> > Moreover, we can take this as a chance to refactory CarbonData to make it
> > both easier to use and higher performance.
> >
> > Usability:
> > Instead of using CarbonContext, in spark2 integration, user should able
> to
> > 1. use native SparkSession in the spark application to create and query
> > table backed by CarbonData files with full feature support, including
> > index and late decode optimization.
> >
> > 2. use CarbonData's API and tool to acomplish carbon specific tasks, like
> > compaction, delete segment, etc.
> >
> > Perforamnce:
> > 1. deep integration with Datasource API and leveraging spark2's whole
> > stage codegen feature.
> >
> > 2. provide implementation of vectorized record reader, to improve
> scanning
> > performance.
> >
> > Since spark2 changes a lot comparing to spark 1.6, it may take some time
> > to complete all these features. With the help of contributors and
> > committers, I hope we can have basic features working in next CarbonData
> > release.
> >
> > What do you think about this idea? All kinds of contribution and
> > suggestions are welcomed.
> >
> > Regards,
> > Jacky Li
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Feature-
> Proposal-Spark-2-integration-with-CarbonData-tp3236p3238.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

Jacky Li

Re: [Feature Proposal] Spark 2 integration with CarbonData

Hi Ramana,

Sure, I can work out a subtasks list and put it under CARBONDATA-322

Regards,
Jacky

David CaiQiang

Re: [Feature Proposal] Spark 2 integration with CarbonData

+1
I think I can finish some tasks. please assign some tasks to me.

Best Regards
David Cai

Jihong Ma

RE: [Feature Proposal] Spark 2 integration with CarbonData

In reply to this post by Jacky Li

Integration with Spark 2.x is a great feature for Carbondata as Spark 2.x is getting the momentum gradually. This is a big effort ahead and let's take into consideration of all the complexity involved due to dramatic API level change， realizing it in phases is a good idea.

Regards.

Jihong

-----Original Message-----
From: Jacky Li [mailto:[hidden email]]
Sent: Saturday, November 26, 2016 10:08 AM
To: [hidden email]
Subject: [Feature Proposal] Spark 2 integration with CarbonData

Hi all,

Currently CarbonData only works with spark1.5 and spark1.6, as Apache Spark
community is moving to 2.1, more and more user will deploy spark 2.x in
production environment. In order to make CarbonData even more popular, I
think now it is good time to start considering spark2.x integration with
CarbonData.

Moreover, we can take this as a chance to refactory CarbonData to make it
both easier to use and higher performance.

Usability:
Instead of using CarbonContext, in spark2 integration, user should able to
1. use native SparkSession in the spark application to create and query
table backed by CarbonData files with full feature support, including index
and late decode optimization.

2. use CarbonData's API and tool to acomplish carbon specific tasks, like
compaction, delete segment, etc.

Perforamnce:
1. deep integration with Datasource API and leveraging spark2's whole stage
codegen feature.

2. provide implementation of vectorized record reader, to improve scanning
performance.

Since spark2 changes a lot comparing to spark 1.6, it may take some time to
complete all these features. With the help of contributors and committers, I
hope we can have basic features working in next CarbonData release.

What do you think about this idea? All kinds of contribution and suggestions
are welcomed.

Regards,
Jacky Li

--
View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Feature-Proposal-Spark-2-integration-with-CarbonData-tp3236.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.