[GitHub] [carbondata] marchpure opened a new pull request #3607: [CARBONDATA-3670] Support compress offheap columnpage directly, avoding a copy of data from offhead to heap when compressed.

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure opened a new pull request #3607: [CARBONDATA-3670] Support compress offheap columnpage directly, avoding a copy of data from offhead to heap when compressed.

GitBox
marchpure opened a new pull request #3607: [CARBONDATA-3670] Support compress offheap columnpage directly, avoding a copy of data from offhead to heap when compressed.
URL: https://github.com/apache/carbondata/pull/3607
 
 
    ### Why is this PR needed?
     When loading, the columnpages are stored on the offheap by default,  compression is needed to save storage cost. But, in the compression, the data must be copied from the offheap to the heap before compressed, leads to heavier GC overhead compared with compress offhead data directly.
     Overall, this pr aims to support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap when compressed.
   
    ### What changes were proposed in this PR?
     1. Support compress direct bytebuffer in the SNAPPY/ZSTD/GZIP compressor
        Add Interface compressByte(ByteBuffer) in the Compressor/SnappyCompressor/ZstdCompressor/GzipCompressor.java
     2. Support compress offheap data directly in the columnpage if the dataype is primitive
        2.1 Add Interface getPage in columnpage to get data as directbytebuffer
        2.2 The compress() in the Columnpage.java is changed. If the datatype is primitve and the page is unsafe, compress the directbytebuffer returned by getPage() directly.
     3. Support compress offheap data directly in the columnpage in IndexStorageCodec
        3.1 For String/Varchar, the RLE and InvertIndex needs to get the columnpage as 2-dimension bytearray, in which each bytearray presents a row, We add a interface getByteBufferArray()
            in the Columnpage, to replace the 2-dimension bytearray. Then, InvertIndex and RLE can work on the directbytebuffer directly.
        3.2 If there are no need to build RLE and InvertIndex, getByteBufferArray() return the flatten data as directbytebuffer, which can be compressed directly.
       
    ### Does this PR introduce any user interface change?
    - No
   
    ### Is any new testcase added?
    - Yes
   
       
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583794215
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/188/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583797146
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1890/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583827606
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583829549
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/191/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583835072
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1893/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839095
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839247
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1896/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839556
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/195/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839686
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1897/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840028
 
 
   Build Failed  with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/196/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840176
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840188
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1898/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840206
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583842184
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/197/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583848464
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1899/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376868515
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/columnar/BlockIndexerStorageForNoInvertedIndexForShort.java
 ##########
 @@ -17,52 +17,55 @@
 
 package org.apache.carbondata.core.datastore.columnar;
 
+import java.nio.ByteBuffer;
 import java.util.ArrayList;
+import java.util.Arrays;
 import java.util.List;
 
 import org.apache.carbondata.core.constants.CarbonCommonConstants;
-import org.apache.carbondata.core.util.ByteUtil;
 
 /**
  * Below class will be used to for no inverted index
  */
-public class BlockIndexerStorageForNoInvertedIndexForShort extends BlockIndexerStorage<byte[][]> {
+public class BlockIndexerStorageForNoInvertedIndexForShort
+    extends BlockIndexerStorage<ByteBuffer[]> {
 
   /**
    * column data
    */
-  private byte[][] dataPage;
+  private ByteBuffer[] dataPage;
 
   private short[] dataRlePage;
 
-  public BlockIndexerStorageForNoInvertedIndexForShort(byte[][] dataPage, boolean applyRLE) {
+  public BlockIndexerStorageForNoInvertedIndexForShort(ByteBuffer[] dataPage, boolean applyRLE) {
     this.dataPage = dataPage;
     if (applyRLE) {
-      List<byte[]> actualDataList = new ArrayList<>();
-      for (int i = 0; i < dataPage.length; i++) {
-        actualDataList.add(dataPage[i]);
-      }
+      List<ByteBuffer> actualDataList = Arrays.asList(dataPage);
 
 Review comment:
   **Can we skip converting arrays to list ?**
   Can we change it to use the array directly ? because as it is one dimensional array now, we can remove list. can use array directly in below methods.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376869338
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/columnar/BlockIndexerStorageForNoInvertedIndexForShort.java
 ##########
 @@ -79,12 +82,8 @@ private void rleEncodeOnData(List<byte[]> actualDataList) {
     }
   }
 
-  private byte[][] convertToDataPage(List<byte[]> list) {
-    byte[][] shortArray = new byte[list.size()][];
-    for (int i = 0; i < shortArray.length; i++) {
-      shortArray[i] = list.get(i);
-    }
-    return shortArray;
+  private ByteBuffer[] convertToDataPage(List<ByteBuffer> list) {
 
 Review comment:
   should avoid redundant conversion,  should directly use ByteBuffer[] everywhere, don't convert to list

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376871815
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java
 ##########
 @@ -747,6 +759,16 @@ public long getPageLengthInBytes() throws IOException {
    */
   public byte[] compress(Compressor compressor) throws IOException {
     DataType dataType = columnPageEncoderMeta.getStoreDataType();
+
+    // if the columnpage is isUnsafeEnabled and the Datatype is primitive.
+    // we try to compress the data in offheap directly, avoiding a copy from offheap to heap
+    if (isUnsafeEnabled() && (dataType == DataTypes.BOOLEAN || dataType == BYTE
+        || dataType == SHORT || dataType == DataTypes.SHORT_INT || dataType == INT
+        || dataType == LONG || dataType == FLOAT || dataType == DOUBLE
+        || DataTypes.isDecimal(dataType))) {
 
 Review comment:
   is Decimal supported ?
   
   below I see getByteBufferArrayPage is unsupported in DecimalColumnPage

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376872535
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/datastore/page/SafeFixLengthColumnPage.java
 ##########
 @@ -289,6 +298,15 @@ public BigDecimal getDecimal(int rowId) {
     return data;
   }
 
+  @Override
+  public ByteBuffer[] getByteBufferArrayPage(boolean isFlattened) {
 
 Review comment:
   The changes was only for offheap data right ? so I expect only unsafe pages should have changes. Why changed for safe column pages also ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
12