[GitHub] carbondata pull request #1297: [WIP] Added a value based compression for dec...

classic Classic list List threaded Threaded
43 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1297: [WIP] Added a value based compression for dec...

qiuchenjian-2
GitHub user manishgupta88 opened a pull request:

    https://github.com/apache/carbondata/pull/1297

    [WIP] Added a value based compression for decimal data type when decimal is stored as Int or Long

    Added a value based compression for decimal data type when decimal is stored as Int or Long
   
    1. When decimal precision is <= 9, decimal values are stored in 4 bytes but they are not compressed further based on min and max values as compared with other primitive data type compression. Therefore now based on min and max value decimal data falling in Integer range will be further compressed as byte or short.
    2. When decimal precision is <= 18, decimal values are stored in 8 bytes but they are not compressed further based on min and max values as compared with other primitive data type compression. Therefore now based on min and max value decimal data falling in Long range will be further compressed as byte, short or int.
   
    Advantage: This will reduce the storage space thereby decreasing the IO time while decompressing the data.
   
    TODO: Need to be implemented for unsafe type


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishgupta88/carbondata decimal_compression

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1297
   
----
commit 04d47802fa139f9a359ca6727f4f79a26402b43a
Author: manishgupta88 <[hidden email]>
Date:   2017-08-24T07:13:58Z

    Added a value based compression for decimal data type.
   
    1. When decimal precision is <= 9, decimal values are stored in 4 bytes but they are not compressed further based on min and max values as compared with other primitive data type compression. Therefore now based on min and max value decimal data falling in Integer range will be further compressed as byte or short.
    2. When decimal precision is <= 18, decimal values are stored in 8 bytes but they are not compressed further based on min and max values as compared with other primitive data type compression. Therefore now based on min and max value decimal data falling in Long range will be further compressed as byte, short or int.
   
    Advantage: This will reduce the storage space thereby decreasing the IO time while decompressing the data.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/304/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/308/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/328/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/330/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    SDV Build Failed with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/340/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3431/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1297: [CARBONDATA-1429] Add a value based compressi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1297#discussion_r136248243
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/SafeFixLengthColumnPage.java ---
    @@ -165,9 +165,28 @@ public double getDouble(int rowId) {
         return doubleData[rowId];
       }
     
    -  @Override
    -  public BigDecimal getDecimal(int rowId) {
    -    throw new UnsupportedOperationException("invalid data type: " + dataType);
    +  @Override public BigDecimal getDecimal(int rowId) {
    --- End diff --
   
    I think it is better to add a DecimalColumnPage instead of using FixLengthColumnPage


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/3432/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1297: [CARBONDATA-1429] Add a value based compressi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1297#discussion_r136260340
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/SafeFixLengthColumnPage.java ---
    @@ -165,9 +165,28 @@ public double getDouble(int rowId) {
         return doubleData[rowId];
       }
     
    -  @Override
    -  public BigDecimal getDecimal(int rowId) {
    -    throw new UnsupportedOperationException("invalid data type: " + dataType);
    +  @Override public BigDecimal getDecimal(int rowId) {
    --- End diff --
   
    Please correct me If I am wrong....From my opinion it will not be any advantage to add DecimalColumnPage as we are not doing anything extra if we introduce DecimalColumnPage.
    Basically we are storing values either as fixed or variable length and decimal values also falls in these 2 categories. So according to current design I believe it may not be required to introduce 1 more type of page until and unless we have pages for all other datatypes separately.
    Please share your suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    SDV Build Success with Spark 2.1, Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/348/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ashwini-krishnakumar commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    @ravipesala please review


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1297: [CARBONDATA-1429] Add a value based compressi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1297#discussion_r137270300
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/DefaultEncodingStrategy.java ---
    @@ -104,18 +107,31 @@ private ColumnPageEncoder createEncoderForMeasure(ColumnPage columnPage) {
           case SHORT:
           case INT:
           case LONG:
    -        return selectCodecByAlgorithmForIntegral(stats).createEncoder(null);
    +        return selectCodecByAlgorithmForIntegral(stats,
    +            DecimalConverterFactory.DecimalConverterType.DECIMAL_LONG).createEncoder(null);
    +      case DECIMAL:
    +        return createEncoderForDecimalDataTypeMeasure(columnPage);
    --- End diff --
   
    Rename as others, `selectCodecByAlgorithmForDecimal`.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1297: [CARBONDATA-1429] Add a value based compressi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1297#discussion_r137270799
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/encoding/DefaultEncodingStrategy.java ---
    @@ -130,7 +146,8 @@ private static DataType fitLongMinMax(long max, long min) {
         }
       }
     
    -  private static DataType fitMinMax(DataType dataType, Object max, Object min) {
    +  private static DataType fitMinMax(DataType dataType, Object max, Object min,
    +      DecimalConverterFactory.DecimalConverterType decimalConverterType) {
    --- End diff --
   
    I think it is not good to pass `decimalConverterType` in many functions. It makes code complex. It is better to think of a way to encapsulate it.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #1297: [CARBONDATA-1429] Add a value based compressi...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1297#discussion_r137273402
 
    --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/page/SafeFixLengthColumnPage.java ---
    @@ -165,9 +165,28 @@ public double getDouble(int rowId) {
         return doubleData[rowId];
       }
     
    -  @Override
    -  public BigDecimal getDecimal(int rowId) {
    -    throw new UnsupportedOperationException("invalid data type: " + dataType);
    +  @Override public BigDecimal getDecimal(int rowId) {
    --- End diff --
   
    I think adding converter inside the column page make it a bit complex. I think there are two mays to make the code clean:
    1. Store BigDecimal array in the ColumnPage and do the conversion to int or long when come to encoding part, by implementing a ColumnPageValueConverter
    2. Create DecimalColumnPage class and add the conversion logic in DecimalColumnPage. This logic is the extra logic for decimal only. If you do like this, we can consider to rename `FixLengthColumnPage` to `PrimitiveColumnPage` and `VarLengthColumnPage` to `StringColumnPage` and the 3rd one is `DecimalColumnPage`. The responsibility of each class will be more clear.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    Build Success with Spark 1.6, Please check CI http://144.76.159.231:8080/job/ApacheCarbonPRBuilder/828/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/16/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #1297: [CARBONDATA-1429] Add a value based compression for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/1297
 
    Build Failed  with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/113/



---
123