CarbonDataQA1 commented on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-593843924 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2277/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387449577 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/compression/AbstractCompressor.java ########## @@ -48,10 +49,11 @@ } @Override - public byte[] compressInt(int[] unCompInput) { - ByteBuffer unCompBuffer = ByteBuffer.allocate(unCompInput.length * ByteUtil.SIZEOF_INT); + public ByteBuffer compressInt(int[] unCompInput) { + ByteBuffer unCompBuffer = ByteBuffer.allocateDirect(unCompInput.length * ByteUtil.SIZEOF_INT); Review comment: `DirecByteBuffer` are tricky, they reside outside JVM. we need to clean by reflection. Hence it is not used in our code. (can use unsafe if we still need to offheap memory) DirecByteBuffers are useful only while interacting with native libraries. So, I suggest we can use `ByteBuffer` instead of `DirectByteBuffer` https://stackoverflow.com/questions/5670862/bytebuffer-allocate-vs-bytebuffer-allocatedirect @ravipesala , @kumarvishal09 what's your opinion on this ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387465206 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/compression/Compressor.java ########## @@ -18,36 +18,39 @@ package org.apache.carbondata.core.datastore.compression; import java.io.IOException; +import java.nio.ByteBuffer; public interface Compressor { String getName(); - byte[] compressByte(byte[] unCompInput); + ByteBuffer compressByte(ByteBuffer compInput); Review comment: Similar problem exists in query flow also ? can we use `ByteBuffer` for `uncompress` return type also ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387467031 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -143,10 +144,17 @@ public static ColumnPage newLocalDictPage(ColumnPageEncoderMeta columnPageEncode boolean isDecoderBasedFallBackEnabled = Boolean.parseBoolean(CarbonProperties.getInstance() .getProperty(CarbonCommonConstants.LOCAL_DICTIONARY_DECODER_BASED_FALLBACK, CarbonCommonConstants.LOCAL_DICTIONARY_DECODER_BASED_FALLBACK_DEFAULT)); + DataType dataType = columnPageEncoderMeta.getStoreDataType(); ColumnPage actualPage; ColumnPage encodedPage; if (isUnsafeEnabled(columnPageEncoderMeta)) { - actualPage = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); + if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + actualPage = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else { + actualPage = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); Review comment: local dictionary supported for string and varchar. so this may be dead code now. remove it ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387467864 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -182,20 +190,18 @@ public static ColumnPage newPage(ColumnPageEncoderMeta columnPageEncoderMeta, in dataType == LONG || dataType == DataTypes.FLOAT || dataType == DataTypes.DOUBLE) { - instance = new UnsafeFixLengthColumnPage( - new ColumnPageEncoderMeta(columnSpec, dataType, compressorName), pageSize); + instance = new UnsafeFixLengthColumnPage(columnPageEncoderMeta, pageSize); } else if (dataType == DataTypes.TIMESTAMP) { instance = new UnsafeFixLengthColumnPage( new ColumnPageEncoderMeta(columnSpec, LONG, compressorName), pageSize); } else if (DataTypes.isDecimal(dataType)) { - instance = new UnsafeDecimalColumnPage( - new ColumnPageEncoderMeta(columnSpec, dataType, compressorName), pageSize); - } else if (dataType == DataTypes.STRING - || dataType == BYTE_ARRAY - || dataType == DataTypes.VARCHAR - || dataType == DataTypes.BINARY) { - instance = new UnsafeVarLengthColumnPage( - new ColumnPageEncoderMeta(columnSpec, dataType, compressorName), pageSize); + instance = new UnsafeDecimalColumnPage(columnPageEncoderMeta, pageSize); + } else if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + instance = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else if (dataType == BYTE_ARRAY) { + instance = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); Review comment: This is also LV flow, I think we can use `ByteBufferColumnPage` for BYTE_ARRAY also ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387467961 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -218,10 +224,11 @@ public static ColumnPage newPage(ColumnPageEncoderMeta columnPageEncoderMeta, in instance = newDoublePage(columnPageEncoderMeta, new double[pageSize]); } else if (DataTypes.isDecimal(dataType)) { instance = newDecimalPage(columnPageEncoderMeta, new byte[pageSize][]); - } else if (dataType == DataTypes.STRING - || dataType == BYTE_ARRAY - || dataType == DataTypes.VARCHAR - || dataType == DataTypes.BINARY) { + } else if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + instance = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else if (dataType == BYTE_ARRAY) { Review comment: same as above ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387469248 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/compression/AbstractCompressor.java ########## @@ -48,10 +49,11 @@ } @Override - public byte[] compressInt(int[] unCompInput) { - ByteBuffer unCompBuffer = ByteBuffer.allocate(unCompInput.length * ByteUtil.SIZEOF_INT); + public ByteBuffer compressInt(int[] unCompInput) { + ByteBuffer unCompBuffer = ByteBuffer.allocateDirect(unCompInput.length * ByteUtil.SIZEOF_INT); Review comment: DirectByteBuffer will be GC collected if there are no object reference to it. In our compression case, this object is temporary. So I think it is ok. @ajantha-bhat ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387469426 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/compression/Compressor.java ########## @@ -18,36 +18,39 @@ package org.apache.carbondata.core.datastore.compression; import java.io.IOException; +import java.nio.ByteBuffer; public interface Compressor { String getName(); - byte[] compressByte(byte[] unCompInput); + ByteBuffer compressByte(ByteBuffer compInput); Review comment: Many places need to change for query flow, can not do in this PR ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387470422 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -182,20 +190,18 @@ public static ColumnPage newPage(ColumnPageEncoderMeta columnPageEncoderMeta, in dataType == LONG || dataType == DataTypes.FLOAT || dataType == DataTypes.DOUBLE) { - instance = new UnsafeFixLengthColumnPage( - new ColumnPageEncoderMeta(columnSpec, dataType, compressorName), pageSize); + instance = new UnsafeFixLengthColumnPage(columnPageEncoderMeta, pageSize); } else if (dataType == DataTypes.TIMESTAMP) { instance = new UnsafeFixLengthColumnPage( new ColumnPageEncoderMeta(columnSpec, LONG, compressorName), pageSize); } else if (DataTypes.isDecimal(dataType)) { - instance = new UnsafeDecimalColumnPage( - new ColumnPageEncoderMeta(columnSpec, dataType, compressorName), pageSize); - } else if (dataType == DataTypes.STRING - || dataType == BYTE_ARRAY - || dataType == DataTypes.VARCHAR - || dataType == DataTypes.BINARY) { - instance = new UnsafeVarLengthColumnPage( - new ColumnPageEncoderMeta(columnSpec, dataType, compressorName), pageSize); + instance = new UnsafeDecimalColumnPage(columnPageEncoderMeta, pageSize); + } else if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + instance = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else if (dataType == BYTE_ARRAY) { + instance = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); Review comment: It is not always for LV flow, for example, for local dictionary it uses UnsafeFixLengthColumnPage with store data type as BYTE_ARRAY ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387470463 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -143,10 +144,17 @@ public static ColumnPage newLocalDictPage(ColumnPageEncoderMeta columnPageEncode boolean isDecoderBasedFallBackEnabled = Boolean.parseBoolean(CarbonProperties.getInstance() .getProperty(CarbonCommonConstants.LOCAL_DICTIONARY_DECODER_BASED_FALLBACK, CarbonCommonConstants.LOCAL_DICTIONARY_DECODER_BASED_FALLBACK_DEFAULT)); + DataType dataType = columnPageEncoderMeta.getStoreDataType(); ColumnPage actualPage; ColumnPage encodedPage; if (isUnsafeEnabled(columnPageEncoderMeta)) { - actualPage = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); + if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + actualPage = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else { + actualPage = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); Review comment: removed ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387470537 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -218,10 +224,11 @@ public static ColumnPage newPage(ColumnPageEncoderMeta columnPageEncoderMeta, in instance = newDoublePage(columnPageEncoderMeta, new double[pageSize]); } else if (DataTypes.isDecimal(dataType)) { instance = newDecimalPage(columnPageEncoderMeta, new byte[pageSize][]); - } else if (dataType == DataTypes.STRING - || dataType == BYTE_ARRAY - || dataType == DataTypes.VARCHAR - || dataType == DataTypes.BINARY) { + } else if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + instance = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else if (dataType == BYTE_ARRAY) { Review comment: It is not always for LV flow, for example, for local dictionary it uses UnsafeFixLengthColumnPage with store data type as BYTE_ARRAY ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
ajantha-bhat commented on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594350791 I have gone through on a very high level. I need to check out code and review agian. @jackylk : It will be useful if you can give one flow of all the 5 places where new data was created. I couldn't find 5 places. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594351151 @ajantha-bhat ``` CarbonRowDataWriterProcessorStepImpl.processBatch: CarbonRow converted = convertRowWithoutRearrange(row); TablePage.convertToColumnarAndAddToPages: byte[] valueWithLength = addShortLengthToByteArray((byte[]) noDictAndComplex[i]); // alloc and copy noDictDimensionPages[i].putData(rowId, valueWithLength); // safe copy to unsafe HighCardDictDimensionIndexCodec.encodeIndexStorage: byte[][] data = input.getByteArrayPage(); // unsafe copy to safe byte[] flattened = ByteUtil.flatten(indexStorage.getDataPage()); // alloc and copy SnappyCompressor.compressByte: Snappy.rawCompress(unCompInput, unCompInput.length); // copy inside, shoud use rawCompress(Object input, int inputOffset, int inputLength, byte[] output, int outputOffset) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk edited a comment on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594351151 @ajantha-bhat In original code, 5 copies are: ``` TablePage.convertToColumnarAndAddToPages: byte[] valueWithLength = addShortLengthToByteArray((byte[]) noDictAndComplex[i]); // alloc and copy noDictDimensionPages[i].putData(rowId, valueWithLength); // safe copy to unsafe HighCardDictDimensionIndexCodec.encodeIndexStorage: byte[][] data = input.getByteArrayPage(); // copy unsafe to safe byte[] flattened = ByteUtil.flatten(indexStorage.getDataPage()); // alloc and copy SnappyCompressor.compressByte: Snappy.rawCompress(unCompInput, unCompInput.length); // one copy and one compression write inside, shoud use direct buffer which has only one compression write ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk edited a comment on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594351151 @ajantha-bhat In original code, 5 copies are: ``` TablePage.convertToColumnarAndAddToPages: byte[] valueWithLength = addShortLengthToByteArray((byte[]) noDictAndComplex[i]); // alloc and copy noDictDimensionPages[i].putData(rowId, valueWithLength); // safe copy to unsafe HighCardDictDimensionIndexCodec.encodeIndexStorage: byte[][] data = input.getByteArrayPage(); // copy unsafe to safe byte[] flattened = ByteUtil.flatten(indexStorage.getDataPage()); // alloc and copy SnappyCompressor.compressByte: Snappy.rawCompress(unCompInput, unCompInput.length); // one copy and one compression write inside, should use direct buffer version of the API which has only one compression write ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk edited a comment on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594351151 @ajantha-bhat In original code, 5 copies are: ``` TablePage.convertToColumnarAndAddToPages: byte[] valueWithLength = addShortLengthToByteArray((byte[]) noDictAndComplex[i]); // alloc and copy noDictDimensionPages[i].putData(rowId, valueWithLength); // copy safe to unsafe HighCardDictDimensionIndexCodec.encodeIndexStorage: byte[][] data = input.getByteArrayPage(); // copy unsafe to safe byte[] flattened = ByteUtil.flatten(indexStorage.getDataPage()); // alloc and copy SnappyCompressor.compressByte: Snappy.rawCompress(unCompInput, unCompInput.length); // one copy and one compression write inside, should use direct buffer version of the API which has only one compression write ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594363382 Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/2304/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594364130 Build Failed with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/597/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
jackylk commented on a change in pull request #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#discussion_r387470463 ########## File path: core/src/main/java/org/apache/carbondata/core/datastore/page/ColumnPage.java ########## @@ -143,10 +144,17 @@ public static ColumnPage newLocalDictPage(ColumnPageEncoderMeta columnPageEncode boolean isDecoderBasedFallBackEnabled = Boolean.parseBoolean(CarbonProperties.getInstance() .getProperty(CarbonCommonConstants.LOCAL_DICTIONARY_DECODER_BASED_FALLBACK, CarbonCommonConstants.LOCAL_DICTIONARY_DECODER_BASED_FALLBACK_DEFAULT)); + DataType dataType = columnPageEncoderMeta.getStoreDataType(); ColumnPage actualPage; ColumnPage encodedPage; if (isUnsafeEnabled(columnPageEncoderMeta)) { - actualPage = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); + if (dataType == DataTypes.STRING || + dataType == DataTypes.VARCHAR || + dataType == DataTypes.BINARY) { + actualPage = new ByteBufferColumnPage(columnPageEncoderMeta, pageSize); + } else { + actualPage = new UnsafeVarLengthColumnPage(columnPageEncoderMeta, pageSize); Review comment: Not really, complex column needs this ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3638: [CARBONDATA-3731] Avoid data copy in Writer
URL: https://github.com/apache/carbondata/pull/3638#issuecomment-594415233 Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/601/ ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [hidden email] With regards, Apache Git Services |
Free forum by Nabble | Edit this page |