More and more people use big data to optimize their algorithm, train their model, deploy their model as service and inference image. It's big challenge to storage, manage and analysis lots of structured and unstructured data, especially unstructured data, like image, video, audio and so on. Many users use python to install their project for these scenario. Apache CarbonData is an indexed columnar data store solution for fast analytics on big data platform. Apache CarbonData has many great feature and high performance to storage, manage and analysis big data. Apache CarbonData not only already supported String, Int, Double, Boolean, Char,Date, TImeStamp data types, but also supported Binay (CARBONDATA-3336), which can avoid small binary files problem and speed up S3 access performance reach dozens or even hundreds of times, also can decrease cost of accessing OBS by decreasing the number of calling S3 API. But it's not easy for them to use carbon by Java/Scala/C++. So it's better to provide python interface for these users to use CarbonData by python code We already work for these feature several months in https://github.com/xubo245/pycarbon *Goals: 1. Apache CarbonData should provides python interface to support to write and read structured and unstructured data in CarbonData, like String, int and binary data: image/voice/video. It should not dependency Apache Spark. 2. Apache CarbonData should provides python interface to support deep learning framework to ready and write data from/to CarbonData, like TensorFlow , MXNet, PyTorch and so on. It should not dependency Apache Spark. 3. Apache CarbonData should provides python interface to manage and analysis data based on Apache Spark. Apache CarbonData should support DDL, DML, DataMap feature in Python.* *Details:* *1. Apache CarbonData should provides python interface to support to write and read structured and unstructured data in CarbonData, like String, int and binary data: image/voice/video. It should not dependency Apache Spark.* Apache CarbonData already provide Java/ Scala/C++ interface for users, and more and more people use python to manage and analysis big data, so it's better to provide python interface to support to write and read structured and unstructured data in CarbonData, like String, int and binary data: image/voice/video. It should not dependency Apache Spark. We called it is PYSDK. PYSDK based on CarbonData Java SDK, use pyjnius to call java code in python code. Even though Apache Spark use py4j in PySpark to call java code in python, but it's low performance when use py4j to read bigdata with CarbonData format in python code, py4j also show low performance when read big data in their report: https://www.py4j.org/advanced_topics.html#performance. JPype is also a popular tool to call java code in python, but it already stoped update several years ago, so we can not use it. In our test, pyjnius has high performance to read big data by call java code in python, so it's good choice for us. We already work for these feature several months in https://github.com/xubo245/pycarbon Goals: 1). PYSDK should provide interface to support read data 2). PYSDK should provide interface to support write data 3). PYSDK should support basic data types 4). PYSDK should support projection 5). PYSDK should support filter *2. Apache CarbonData should provides python interface to support deep learning framework to ready and write data from/to CarbonData, like TensorFlow , MXNet, PyTorch and so on. It should not dependency Apache Spark.* Goals: 1). CarbonData provides python interface to support TensorFlow to ready data from CarbonData for training model 2). CarbonData provides python interface to support MXNet to ready data from CarbonData for training model 3). CarbonData provides python interface to support PyTorch to ready data from CarbonData for training model 4). CarbonData should support epoch function 5). CarbonData should support cache for speed up performance. *3.Apache CarbonData should provides python interface to manage and analysis data based on Apache Spark. Apache CarbonData should support DDL, DML, DataMap feature in Python.* Goals: 1). PyCarbon support read data from local/HDFS/S3 in python code by PySpark DataFrame 2). PyCarbon support write data in python code to local/HDFS/S3 by PySpark DataFrame 3). PyCarbon support DDL in python with sql format 4). PyCarbon support DML in python with sql format 5). PyCarbon support DataMap in python with sql format The JIRA is: https://issues.apache.org/jira/browse/CARBONDATA-3254 -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1
Great proposal, thanks for contributing Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
+1 ,
As we have already worked on it, we have to integrate it as clean as possible. I think this can be done by 2 layers. 1. *PySDK:* Generic python layer over java SDK. Users who doesn't need AI support but just python SDK layer can use just this. a. This supports read, write carbondata files (like java SDK). *We can have a document to mention what all API we support.* b. This layer also supports building Arrow carbon reader which is supported by java SDK. Here we read carbon files and fill the in memory arrow vector. This is used by PyCarbon layer. 2. *PyCarbon:* This layer will be responsible for integrating carbondata with AI engines like TensorFlow , MXNet, PyTorch to provide AI scenarios like epoch and shuffle. As *Uber's petastorm (open source Apache license project)* supports all above scenarios by using arrow format and also already carbondata can write to arrow vector from SDK (#3193). So integration is easy and we just have to add dependency of petastorm in carbondata project. * I suggest we take the latest version of petastorm now [v0.77]* * We can have a design document to mention how this is done and what all the interface we support from pycarbon.* Thanks, Ajantha On Sun, Nov 24, 2019 at 11:17 AM Jacky Li <[hidden email]> wrote: > +1 > > Great proposal, thanks for contributing > > Regards, > Jacky > > > > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ > |
Free forum by Nabble | Edit this page |