marchpure opened a new pull request #3607: [CARBONDATA-3670] Support compress offheap columnpage directly, avoding a copy of data from offhead to heap when compressed.
URL: https://github.com/apache/carbondata/pull/3607 ### Why is this PR needed? When loading, the columnpages are stored on the offheap by default, compression is needed to save storage cost. But, in the compression, the data must be copied from the offheap to the heap before compressed, leads to heavier GC overhead compared with compress offhead data directly. Overall, this pr aims to support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap when compressed. ### What changes were proposed in this PR? 1. Support compress direct bytebuffer in the SNAPPY/ZSTD/GZIP compressor Add Interface compressByte(ByteBuffer) in the Compressor/SnappyCompressor/ZstdCompressor/GzipCompressor.java 2. Support compress offheap data directly in the columnpage if the dataype is primitive 2.1 Add Interface getPage in columnpage to get data as directbytebuffer 2.2 The compress() in the Columnpage.java is changed. If the datatype is primitve and the page is unsafe, compress the directbytebuffer returned by getPage() directly. 3. Support compress offheap data directly in the columnpage in IndexStorageCodec 3.1 For String/Varchar, the RLE and InvertIndex needs to get the columnpage as 2-dimension bytearray, in which each bytearray presents a row, We add a interface getByteBufferArray() in the Columnpage, to replace the 2-dimension bytearray. Then, InvertIndex and RLE can work on the directbytebuffer directly. 3.2 If there are no need to build RLE and InvertIndex, getByteBufferArray() return the flatten data as directbytebuffer, which can be compressed directly. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - Yes ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583794215 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/188/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583797146 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1890/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583827606 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583829549 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/191/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583835072 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1893/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839095 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839247 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1896/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839556 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/195/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583839686 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1897/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840028 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/196/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840176 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840188 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1898/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
marchpure commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583840206 retest this please ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583842184 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/197/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#issuecomment-583848464 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1899/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376868515 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/columnar/BlockIndexerStorageForNoInvertedIndexForShort.java ########## @@ -17,52 +17,55 @@ package org.apache.carbondata.core.datastore.columnar; +import java.nio.ByteBuffer; import java.util.ArrayList; +import java.util.Arrays; import java.util.List; import org.apache.carbondata.core.constants.CarbonCommonConstants; -import org.apache.carbondata.core.util.ByteUtil; /** * Below class will be used to for no inverted index */ -public class BlockIndexerStorageForNoInvertedIndexForShort extends BlockIndexerStorage<byte[][]> { +public class BlockIndexerStorageForNoInvertedIndexForShort + extends BlockIndexerStorage<ByteBuffer[]> { /** * column data */ - private byte[][] dataPage; + private ByteBuffer[] dataPage; private short[] dataRlePage; - public BlockIndexerStorageForNoInvertedIndexForShort(byte[][] dataPage, boolean applyRLE) { + public BlockIndexerStorageForNoInvertedIndexForShort(ByteBuffer[] dataPage, boolean applyRLE) { this.dataPage = dataPage; if (applyRLE) { - List<byte[]> actualDataList = new ArrayList<>(); - for (int i = 0; i < dataPage.length; i++) { - actualDataList.add(dataPage[i]); - } + List<ByteBuffer> actualDataList = Arrays.asList(dataPage); Review comment: **Can we skip converting arrays to list ?** Can we change it to use the array directly ? because as it is one dimensional array now, we can remove list. can use array directly in below methods. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376869338 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/columnar/BlockIndexerStorageForNoInvertedIndexForShort.java ########## @@ -79,12 +82,8 @@ private void rleEncodeOnData(List<byte[]> actualDataList) { } } - private byte[][] convertToDataPage(List<byte[]> list) { - byte[][] shortArray = new byte[list.size()][]; - for (int i = 0; i < shortArray.length; i++) { - shortArray[i] = list.get(i); - } - return shortArray; + private ByteBuffer[] convertToDataPage(List<ByteBuffer> list) { Review comment: should avoid redundant conversion, should directly use ByteBuffer[] everywhere, don't convert to list ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376871815 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -747,6 +759,16 @@ public long getPageLengthInBytes() throws IOException { */ public byte[] compress(Compressor compressor) throws IOException { DataType dataType = columnPageEncoderMeta.getStoreDataType(); + + // if the columnpage is isUnsafeEnabled and the Datatype is primitive. + // we try to compress the data in offheap directly, avoiding a copy from offheap to heap + if (isUnsafeEnabled() && (dataType == DataTypes.BOOLEAN || dataType == BYTE + || dataType == SHORT || dataType == DataTypes.SHORT_INT || dataType == INT + || dataType == LONG || dataType == FLOAT || dataType == DOUBLE + || DataTypes.isDecimal(dataType))) { Review comment: is Decimal supported ? below I see getByteBufferArrayPage is unsupported in DecimalColumnPage ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3607: [CARBONDATA-3670] Support compress offheap data in columnpage directly, avoding a copy of data from offhead to heap before compressed.
URL: https://github.com/apache/carbondata/pull/3607#discussion_r376872535 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/SafeFixLengthColumnPage.java ########## @@ -289,6 +298,15 @@ public BigDecimal getDecimal(int rowId) { return data; } + @Override + public ByteBuffer[] getByteBufferArrayPage(boolean isFlattened) { Review comment: The changes was only for offheap data right ? so I expect only unsafe pages should have changes. Why changed for safe column pages also ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |