Apache CarbonData Dev Mailing List archive

[DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

Classic

List

Threaded

3 messages Options

xubo245

[DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

More and more people use big data to optimize their algorithm, train their
model, deploy their model as service and inference image. It's big challenge
to storage, manage and analysis lots of structured and unstructured data,
especially unstructured data, like image, video, audio and so on.

Many users use python to install their project for these scenario. Apache
CarbonData is an indexed columnar data store solution for fast analytics on
big data platform. Apache CarbonData has many great feature and high
performance to storage, manage and analysis big data. Apache CarbonData not
only already supported String, Int, Double, Boolean, Char,Date, TImeStamp
data types, but also supported Binay (CARBONDATA-3336), which can avoid
small binary files problem and speed up S3 access performance reach dozens
or even hundreds of times, also can decrease cost of accessing OBS by
decreasing the number of calling S3 API. But it's not easy for them to use
carbon by Java/Scala/C++. So it's better to provide python interface for
these users to use CarbonData by python code

We already work for these feature several months in
https://github.com/xubo245/pycarbon

*Goals:
1. Apache CarbonData should provides python interface to support to write
and read structured and unstructured data in CarbonData, like String, int
and binary data: image/voice/video. It should not dependency Apache Spark.
2. Apache CarbonData should provides python interface to support deep
learning framework to ready and write data from/to CarbonData, like
TensorFlow , MXNet, PyTorch and so on. It should not dependency Apache
Spark.
3. Apache CarbonData should provides python interface to manage and analysis
data based on Apache Spark. Apache CarbonData should support DDL, DML,
DataMap feature in Python.*

*Details:*
*1. Apache CarbonData should provides python interface to support to write
and read structured and unstructured data in CarbonData, like String, int
and binary data: image/voice/video. It should not dependency Apache Spark.*
Apache CarbonData already provide Java/ Scala/C++ interface for users, and
more and more people use python to manage and analysis big data, so it's
better to provide python interface to support to write and read structured
and unstructured data in CarbonData, like String, int and binary data:
image/voice/video. It should not dependency Apache Spark. We called it is
PYSDK.

PYSDK based on CarbonData Java SDK, use pyjnius to call java code in python
code. Even though Apache Spark use py4j in PySpark to call java code in
python, but it's low performance when use py4j to read bigdata with
CarbonData format in python code, py4j also show low performance when read
big data in their report:
https://www.py4j.org/advanced_topics.html#performance. JPype is also a
popular tool to call java code in python, but it already stoped update
several years ago, so we can not use it. In our test, pyjnius has high
performance to read big data by call java code in python, so it's good
choice for us.

We already work for these feature several months in
https://github.com/xubo245/pycarbon
Goals:

1). PYSDK should provide interface to support read data
2). PYSDK should provide interface to support write data
3). PYSDK should support basic data types
4). PYSDK should support projection
5). PYSDK should support filter

*2. Apache CarbonData should provides python interface to support deep
learning framework to ready and write data from/to CarbonData, like
TensorFlow , MXNet, PyTorch and so on. It should not dependency Apache
Spark.*

Goals：
1). CarbonData provides python interface to support TensorFlow to ready data
from CarbonData for training model
2). CarbonData provides python interface to support MXNet to ready data from
CarbonData for training model
3). CarbonData provides python interface to support PyTorch to ready data
from CarbonData for training model
4). CarbonData should support epoch function
5). CarbonData should support cache for speed up performance.

*3.Apache CarbonData should provides python interface to manage and analysis
data based on Apache Spark. Apache CarbonData should support DDL, DML,
DataMap feature in Python.*

Goals：
1). PyCarbon support read data from local/HDFS/S3 in python code by PySpark
DataFrame
2). PyCarbon support write data in python code to local/HDFS/S3 by PySpark
DataFrame
3). PyCarbon support DDL in python with sql format
4). PyCarbon support DML in python with sql format
5). PyCarbon support DataMap in python with sql format

The JIRA is:

https://issues.apache.org/jira/browse/CARBONDATA-3254

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Jacky Li

Re: [DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

+1

Great proposal, thanks for contributing

Regards,
Jacky

--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Ajantha Bhat

Re: [DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

+1 ,

As we have already worked on it, we have to integrate it as clean as
possible.

I think this can be done by 2 layers.

1. *PySDK:* Generic python layer over java SDK. Users who doesn't need AI
support but just python SDK layer can use just this.
a. This supports read, write carbondata files (like java SDK). *We can
have a document to mention what all API we support.*
b. This layer also supports building Arrow carbon reader which is
supported by java SDK. Here we read carbon files and fill the in memory
arrow vector.
This is used by PyCarbon layer.

2. *PyCarbon:* This layer will be responsible for integrating carbondata
with AI engines like TensorFlow , MXNet, PyTorch to provide AI scenarios
like epoch and shuffle.
As *Uber's petastorm (open source Apache license project)* supports all
above scenarios by using arrow format and also already carbondata can write
to arrow vector from SDK (#3193). So integration is easy and
we just have to add dependency of petastorm in carbondata project.
* I suggest we take the latest version of petastorm now [v0.77]*
* We can have a design document to mention how this is done and what all
the interface we support from pycarbon.*

Thanks,
Ajantha

On Sun, Nov 24, 2019 at 11:17 AM Jacky Li <[hidden email]> wrote:

> +1
>
> Great proposal, thanks for contributing
>
> Regards,
> Jacky
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>