Apache CarbonData Dev Mailing List archive - Re: [DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

Apache CarbonData Dev Mailing List archive

Re: [DISCUSSION] PyCarbon: provide python interface for users to use CarbonData by python code

Posted by Ajantha Bhat on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/DISCUSSION-PyCarbon-provide-python-interface-for-users-to-use-CarbonData-by-python-code-tp87268p87368.html

+1 ,

As we have already worked on it, we have to integrate it as clean as
possible.

I think this can be done by 2 layers.

1. *PySDK:* Generic python layer over java SDK. Users who doesn't need AI
support but just python SDK layer can use just this.
a. This supports read, write carbondata files (like java SDK). *We can
have a document to mention what all API we support.*
b. This layer also supports building Arrow carbon reader which is
supported by java SDK. Here we read carbon files and fill the in memory
arrow vector.
This is used by PyCarbon layer.

2. *PyCarbon:* This layer will be responsible for integrating carbondata
with AI engines like TensorFlow , MXNet, PyTorch to provide AI scenarios
like epoch and shuffle.
As *Uber's petastorm (open source Apache license project)* supports all
above scenarios by using arrow format and also already carbondata can write
to arrow vector from SDK (#3193). So integration is easy and
we just have to add dependency of petastorm in carbondata project.
* I suggest we take the latest version of petastorm now [v0.77]*
* We can have a design document to mention how this is done and what all
the interface we support from pycarbon.*

Thanks,
Ajantha

On Sun, Nov 24, 2019 at 11:17 AM Jacky Li <[hidden email]> wrote:

> +1
>
> Great proposal, thanks for contributing
>
> Regards,
> Jacky
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>