[Discuss] CarbonData supports binary data type

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Discuss] CarbonData supports binary data type

xubo245
This post was updated on .
CarbonData supports binary data type



Version Changes Owner Date
0.1 Init doc for Supporting binary data type Xubo 2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s
better to support binary data type in CarbonData. Download data from S3 will
be slow when dataset has lots of small binary data. The majority of
application scenarios are  related to storage small binary data type into
CarbonData, which can avoid small binary files problem and speed up S3
access performance, also can decrease cost of accessing OBS by decreasing
the number of calling S3 API. It also will easier to manage structure data
and Unstructured data(binary) by storing them into CarbonData.

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
        1.Supporting write binary data type by Carbon Java SDK [Formal]:
            1.1 Java SDK needs support write data with specific data types, like
int, double, byte[ ] data type, no need to convert all data type to string
array. User read binary file as byte[], then SDK writes byte[] into binary
column.
            1.2 CarbonData compress binary column because now the compressor is
table level.
                =>TODO, support configuration for compress, default is no compress because
binary usually is already compressed, like jpg format image. So no need to
uncompress for binary column. 1.5.4 will support column level compression,
after that, we can implement no compress for binary. We can talk with
community.
            1.3 CarbonData stores binary as dimension.
            1.4 Support configure page size for binary data type because binary
data usually is big, such as 200k. Otherwise it will be very big for one
blocklet (32000 rows).
                TODO: 1.5 Avro, JSON convert need consider
               

        2. Supporting read and manage binary data type by Spark Carbon file
format(carbon DataSource) and CarbonSession.[Formal]
            2.1 Supporting read binary data type from non-transaction table, read
binary column and return as byte[]
            2.2 Support create table with binary column, table property doesn’t
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary
column
        => Evaluate COLUMN_META_CACHE for binary
=> carbon.column.compressor for all columns
            2.3 Support CTAS for binary=> transaction/non-transaction
            2.4 Support external table for binary
            2.5 Support projection for binary column
            2.6 Support show table, desc, ALTER TABLE for binary data type
            2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary
            2.8 Support compaction for binary
            2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap,  no
need min max datamap for binary, support mv and pre-aggregate in the future
            2.10 CSDK / python SDK support binary in the future.
            2.11 Support S3
         
CarbonSession: impact analysis
       

        3. Supporting read binary data type by Carbon SDK
            3.1 Supporting read binary data type from non-transaction table, read
binary column and return as byte[]
            3.2 Supporting projection for binary column
            3.3 Supporting S3
            3.4 no need to support filter.

        4. Supporting write binary by spark (carbon file format / carbonsession,
POC??)
            4.1 Convert binary to String and storage in CSV, encode as Hex, Base64
            4.2 Spark load CSV and convert string to binary, and storage in
CarbonData. CarbonData internal will decode Hex to binary.
            4.3 Supporting insert (string => binary, configuration for
encode/decode algorithm, default is Hex, user can change to base64 or
others, is it ok?), update, delete for binary
            4.4 Don’t support stream table.
  => refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


       Please reply this mail if you have any better suggestion!Thanks
                                   xubo

 
JIRA: https://issues.apache.org/jira/browse/CARBONDATA-3336





--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/