Apache CarbonData Dev Mailing List archive

Open Discussion:Apache CarbonData Roadmap

Classic

List

Threaded

8 messages Options

Liang Chen

Open Discussion:Apache CarbonData Roadmap

Administrator

This post was updated on .

Hi

I would like to start one discussion thread for Apache CarbonData Roadmap.
Your any input and comments would be very appreciated!

Apache CarbonData 0.1.0-incubating

Support integration with Apache Spark1.5.2,1.6.1,1.6.2
Support integration with Apache Hadoop 2.2 later version
Columnar data store
Fully Index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
Global Multi Dimensional Keys(MDK) based B+Tree Index for all non-measure columns
Min-Max Index for all columns:.
Inverted index for all dimensions
Operable encoded data :Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted just before returning the results to the users, which is "late materialized".
Column group: Allow multiple columns to form a column group that would be stored as row format. This reduces the row reconstruction cost at query time.
Supports for various use cases with one single Data format : like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).

Apache CarbonData 0.2.0-incubating

Support integration with Apache Spark 2.1
Support Map data type(CARBONDATA-45)
Support create carbondata table select from other datastore’s table
For supporting more flexible data load, remove kettle
Support CarbonDataOutputFormat.
Add create table properties for simplifying data load,especially for high cardinality columns setting

Regards
Liang

Jean-Baptiste Onofré

Re: Open Discussion:Apache CarbonData Roadmap

Hi Liang,

it sounds good.

Any plan to support Apache Beam (instead of Spark directly) ?

Regards
JB

On 08/09/2016 06:02 AM, chenliang613 wrote:

> HiI would like to start one discussion thread for Apache CarbonData
> Roadmap.Your any input and comments would be very appreciated!
> Apache CarbonData 0.1.0-incubating
> Support integration with Apache Spark1.5.2,1.6.1,1.6.2Support integration
> with Apache Hadoop 2.2 later versionColumnar data storeFully Index: it can
> significantly accelerate query performance and reduces the I/O scans and CPU
> resources, where there are filters in the query. it can also do skip scan in
> more finer grain unit (called blocklet) in task side scanning instead of
> scanning the whole file.Global Multi Dimensional Keys(MDK) based B+Tree
> Index for all non-measure columnsMin-Max Index for all columns:.Inverted
> index for all dimensionsOperable encoded data :Through supporting efficient
> compression and global encoding schemes, can query on compressed/encoded
> data, the data can be converted just before returning the results to the
> users, which is "late materialized".Column group: Allow multiple columns to
> form a column group that would be stored as row format. This reduces the row
> reconstruction cost at query time.Supports for various use cases with one
> single Data format : like interactive OLAP-style query, Sequential Access
> (big scan), Random Access (narrow scan).
> Apache CarbonData 0.2.0-incubating
> Support integration with Apache Spark 2.1Support Map data
> type(CARBONDATA-45)Support create carbondata table select from other
> datastore’s tableFor supporting more flexible data load, remove
> kettleSupport CarbonDataOutputFormat.RegardsLiang
>
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Open-Discussion-Apache-CarbonData-Roadmap-tp49.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
>

--
Jean-Baptiste Onofré
[hidden email]
http://blog.nanthrax.net
Talend - http://www.talend.com

Liang Chen

Re: Open Discussion:Apache CarbonData Roadmap

Administrator

Hi Jb

Thanks for your comments.
Remove kettle for preparing to integrate with Apache Beam/Apache Flink for supporting real-time data load.

I would like to propose integration with Apache Beam etc. in Apache CarbonData 0.3.0.

Regards
Liang

Jean-Baptiste Onofré

Re: Open Discussion:Apache CarbonData Roadmap

It sounds like a plan ;)

On 08/09/2016 08:37 AM, chenliang613 wrote:

> Hi Jb
>
> Thanks for your comments.
> Remove kettle for preparing to integrate with Apache Beam/Apache Flink for
> supporting real-time data load.
>
> I would like to propose integration with Apache Beam etc. in Apache
> CarbonData 0.3.0.
>
> Regards
> Liang
>
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Open-Discussion-Apache-CarbonData-Roadmap-tp49p55.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
>

--
Jean-Baptiste Onofré
[hidden email]
http://blog.nanthrax.net
Talend - http://www.talend.com

ZhuWilliam

Re: Open Discussion:Apache CarbonData Roadmap

In reply to this post by Liang Chen

I think the first step we should do is to make carbonate more easy to use . In order to implemented this , the following points may be considered in new version:

1. Kettle should be removed
2. Hive Support recommend to be a optional
3. Properties can be configured by —conf
4. High Cardinality field dictionary can be putted on disk (System file cache will automatically speed up )
5. support spark 2.0

Liang Chen

Re: Open Discussion:Apache CarbonData Roadmap

Administrator

Hi William

Thanks for your input.
Most of your points would be considered in 0.2.0 : remove kettle, add create table properties for simplifying data load,especially for high cardinality columns setting, support 2.0

Regards
Liang

Jacky Li

Re: Open Discussion:Apache CarbonData Roadmap

I think William’s point is valid, we should focus mainly on usability improvement in 0.2.0

Besides what Liang has pointed out, I have a brief list in mind that can be planned in several releases, if they make sense for the community users. They are mainly for more integration and more performance improvement.

1. Streaming ingest. It requires CarbonData to add new format support and integrate with streaming engine
2. Code refactory to make CarbonData in good shape to integrate processing framework other than spark, should be enable to integrate with both batch engine and streaming engine, including Hive/Flink/Beam/SparkStreaming/Kafka , etc.
3. More dictionary support. For example, for really high cardinality columns, can use file level local dictionary for encoding
4. More performance improvement for join operation leveraging CarbonData's late materialization

Regards,
Jacky

> 在 2016年8月9日，下午10:07，chenliang613 <[hidden email]> 写道：
>
> Hi William
>
> Thanks for your input.
> Most of your points would be considered in 0.2.0 : remove kettle, add create
> table properties for simplifying data load,especially for high cardinality
> columns setting, support 2.0
>
> Regards
> Liang
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Open-Discussion-Apache-CarbonData-Roadmap-tp49p65.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
> Received: from 140.211.11.3 (unknown [140.211.11.3])
> by newmx27.qq.com (NewMx) with SMTP id
> for <[hidden email]>; Tue, 09 Aug 2016 22:07:22 +0800
> X-QQ-FEAT: 9w50BnWz/RNfZ7n2vc603oJoUfl5GGivHEdQYBRxC2u7k/n3I2o34fp5yz6iV
> Dw4zg1QjjWpz1Ne/luuMeWylg81hMbQdOIzWd96hnYDLr8Oo9BEhz4BI/7Nv8seHmet6UWV
> kTG3vcV0woN6p3vNFt6AtQk5u/McMnGhxo4a6EjwMzDeTCrS8vTKs8guSWINhP7YI3E2CKz
> HwJxeowSz+Y9P/Sq/78Flhqzh1v3PH7u3AnoWqnKmdVdVF3I9s24fJLtrBYPHiAN9TQ+bwe
> 1Y/g==
> X-QQ-MAILINFO: NL3WKUOj1eeIq9ilG0feeyQgMypg5V3P+LBcwdBmPyY7tepW4nocKSbxX
> 8Yl1xOsQEoqxUiToiLsrhZQFbOerAGpd4F8KNhXiM+Zy1R0HDyfTdKsQxn7uDQZQXhL83Jn
> wUqMGtxYFoTknKDh0EEgNV4=
> X-QQ-mid: usamxproxy15t1470751643tc27q81
> X-QQ-CSender: dev-return-657-jacky.likun=[hidden email]
> X-QQ-ORGSender: dev-return-657-jacky.likun=[hidden email]
> X-KK-mid:usamxproxy15t1470751643tc27q81
> Received: (qmail 62958 invoked by uid 500); 9 Aug 2016 14:07:22 -0000
> Mailing-List: contact [hidden email]; run by ezmlm
> Precedence: bulk
> List-Help: <mailto:[hidden email]>
> List-Unsubscribe: <mailto:[hidden email]>
> List-Post: <mailto:[hidden email]>
> List-Id: <dev.carbondata.incubator.apache.org>
> Reply-To: [hidden email]
> Delivered-To: mailing list [hidden email]
> Received: (qmail 62945 invoked by uid 99); 9 Aug 2016 14:07:22 -0000
> Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
> by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Aug 2016 14:07:22 +0000
> Received: from localhost (localhost [127.0.0.1])
> by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B9F0C1804A2
> for <[hidden email]>; Tue, 9 Aug 2016 14:07:21 +0000 (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: 3.736
> X-Spam-Level: ***
> X-Spam-Status: No, score=3.736 tagged_above=-999 required=6.31
> tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_ENVFROM_END_DIGIT=0.25,
> NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001,
> SPF_SOFTFAIL=0.972, URI_HEX=1.313] autolearn=disabled
> Received: from mx1-lw-eu.apache.org ([10.40.0.8])
> by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
> with ESMTP id 5ZNIc-hs1KLy for <[hidden email]>;
> Tue, 9 Aug 2016 14:07:20 +0000 (UTC)
> Received: from mbob.nabble.com (mbob.nabble.com [162.253.133.15])
> by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 6863860DFD
> for <[hidden email]>; Tue, 9 Aug 2016 14:07:19 +0000 (UTC)
> Received: from msam.nabble.com (unknown [162.253.133.85])
> by mbob.nabble.com (Postfix) with ESMTP id 4ED782E5DCA4
> for <[hidden email]>; Tue, 9 Aug 2016 06:41:42 -0700 (PDT)
> Date: Tue, 9 Aug 2016 07:07:18 -0700 (MST)
> From: chenliang613 <[hidden email]>
> To: [hidden email]
> Message-ID: <[hidden email]>
> In-Reply-To: <[hidden email]>
> References: <[hidden email]> <[hidden email]>
> Subject: Re: Open Discussion:Apache CarbonData Roadmap
> MIME-Version: 1.0
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
>
> Hi William
>
> Thanks for your input.
> Most of your points would be considered in 0.2.0 : remove kettle, add create
> table properties for simplifying data load,especially for high cardinality
> columns setting, support 2.0
>
> Regards
> Liang
>
>
> --
> View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Open-Discussion-Apache-CarbonData-Roadmap-tp49p65.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.

Jihong Ma

Re: Open Discussion:Apache CarbonData Roadmap

In reply to this post by Liang Chen

I would like to add a little more context to Carbon's future plan:

1. Improve usability to make Carbon easy to use, introducing simplified table properties to configure carbon table, for instance: simple configuration to define MDK index, leave the complexity of performance tuning to internal.

2. Adding partitioning support is important for further performance enhancement, widely proved, no doubt about it.

3. Improve Carbon's extensibility : Define clear API interface between Carbon module to make it easy to extend in the future, this is required for integration with other processing framework as well as Carbon's own extension, for instance : introducing new file type to suite different workload.

4. Integration with streaming framework: as first step, enabling Kafka to write out Cabon data as a reliable sink .

Jihong

Sent from HUAWEI AnyOffice
From: Jacky Li
To: [hidden email];
Subject: Re: Open Discussion:Apache CarbonData Roadmap

Time: 2016-08-10 07:42:35
I think William’s point is valid, we should focus mainly on usability improvement in 0.2.0

Besides what Liang has pointed out, I have a brief list in mind that can be planned in several releases, if they make sense for the community users. They are mainly for more integration and more performance improvement.

1. Streaming ingest. It requires CarbonData to add new format support and integrate with streaming engine
2. Code refactory to make CarbonData in good shape to integrate processing framework other than spark, should be enable to integrate with both batch engine and streaming engine, including Hive/Flink/Beam/SparkStreaming/Kafka , etc.
3. More dictionary support. For example, for really high cardinality columns, can use file level local dictionary for encoding
4. More performance improvement for join operation leveraging CarbonData's late materialization

Regards,
Jacky