[jira] [Updated] (CARBONDATA-431) Analysis compression for numeric datatype compared with Parquet/ORC

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CARBONDATA-431) Analysis compression for numeric datatype compared with Parquet/ORC

Akash R Nilugal (Jira)

     [ https://issues.apache.org/jira/browse/CARBONDATA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

suo tong updated CARBONDATA-431:
--------------------------------
    Description:
For the data type, carbon's string type has better compression ratio, but for numeric type, orc has the best compression. we should analysis numeric datatype for carbon to get better compression ratio

DataType    Text Parquet  Orc Carbon
decimal  16G  | 11G      | 6G   |    13G
int          5G   |     1G     |    1G   |    3G
String  24G  | 22G     |    11G   | 3G   (no dictionary)       -------    high cardinality
String 30G    | 4G     |    4G   |    1G  -- Dictionary encode            1G  -- Dictionary encode without inverted index            3G  -- No dictionary encode              -----------low cardinality


  was:
For the data type, carbon's string type has better compression ratio, but for numeric type, orc has the best compression. we should analysis numeric datatype for carbon to get better compression ratio

DataType    Text Parquet  Orc Carbon
decimal  16G  | 11G      | 6G   |    13G
int          5G   |     1G     |    1G   |    3G
String  24G  | 22G     |    11G   | 3G   (no dictionary)       -------    high cardinality
String 30G    | 4G     |    4G   |    1G  -- Dictionary encode            1G  -- Dictionary encode without inverted index            3G  -- No dictionary encode                 (low cardinality)



> Analysis compression for numeric datatype compared with Parquet/ORC
> -------------------------------------------------------------------
>
>                 Key: CARBONDATA-431
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-431
>             Project: CarbonData
>          Issue Type: Sub-task
>            Reporter: suo tong
>            Assignee: Jacky Li
>
> For the data type, carbon's string type has better compression ratio, but for numeric type, orc has the best compression. we should analysis numeric datatype for carbon to get better compression ratio
> DataType    Text Parquet  Orc Carbon
> decimal  16G  | 11G      | 6G   |    13G
> int          5G   |     1G     |    1G   |    3G
> String  24G  | 22G     |    11G   | 3G   (no dictionary)       -------    high cardinality
> String 30G    | 4G     |    4G   |    1G  -- Dictionary encode            1G  -- Dictionary encode without inverted index            3G  -- No dictionary encode              -----------low cardinality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)