GitHub user ravipesala opened a pull request:
https://github.com/apache/carbondata/pull/2820 [CARBONDATA-3013] Added support for pruning pages for vector direct fill. This PR depends on PR https://github.com/apache/carbondata/pull/2819 First, apply page level pruning using the min/max of each page and get the valid pages of blocklet. Decompress only valid pages and fill the vector directly as mentioned in full scan query scenario. For this purpose to prune pages first before decompressing the data, added new method inside a class `FilterExecuter`. ``` BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks) throws FilterUnsupportedException, IOException; ``` The above method reads the necessary column chunk metadata and prunes the pages as per the min/max meta. Based on the pruned pages BlockletScannedResult decompresses and fills the column page data to vector as described in full scan in above mentioned PR . Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ravipesala/incubator-carbondata perf-filter-scan1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2820.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2820 ---- commit f41734b9d123d1ad5f9e7955e594b98899fce208 Author: ravipesala <ravi.pesala@...> Date: 2018-10-16T05:02:18Z Add carbon property to configure vector based row pruning push down commit 4ed51eb0415dd5c92d804bcf9a3d1e6421f56556 Author: ravipesala <ravi.pesala@...> Date: 2018-10-16T06:00:43Z Added support for full scan queries for vector direct fill. commit dbd86a6b103506acf0b7f9783a88c88d8926ed77 Author: ravipesala <ravi.pesala@...> Date: 2018-10-16T09:23:14Z Added support for pruning pages for vector direct fill. ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2820 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/809/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2820 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1006/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2820 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9074/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2820 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/815/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2820 Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9080/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2820 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1012/ --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226326803 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/DimensionRawColumnChunk.java --- @@ -121,6 +122,22 @@ public DimensionColumnPage convertToDimColDataChunkWithOutCache(int index) { } } + /** + * Convert raw data with specified page number processed to DimensionColumnDataChunk and fill + * the vector + * + * @param pageNumber page number to decode and fill the vector + * @param vectorInfo vector to be filled with column page + */ + public void convertToDimColDataChunkAndFillVector(int pageNumber, ColumnVectorInfo vectorInfo) { + assert pageNumber < pagesCount; + try { + chunkReader.decodeColumnPageAndFillVector(this, pageNumber, vectorInfo); + } catch (Exception e) { + throw new RuntimeException(e); --- End diff -- Now not throw underlying exception --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226327006 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/MeasureRawColumnChunk.java --- @@ -94,7 +95,7 @@ public ColumnPage decodeColumnPage(int pageNumber) { public ColumnPage convertToColumnPageWithOutCache(int index) { assert index < pagesCount; // in case of filter query filter columns blocklet pages will uncompressed - // so no need to decode again + // so no need to decodeAndFillVector again --- End diff -- seems no need to modify --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226342877 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/ExcludeFilterExecuterImpl.java --- @@ -143,6 +144,40 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return null; } + @Override + public BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks) + throws FilterUnsupportedException, IOException { + if (isDimensionPresentInCurrentBlock) { + int chunkIndex = segmentProperties.getDimensionOrdinalToChunkMapping() + .get(dimColEvaluatorInfo.getColumnIndex()); + if (null == rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex]) { --- End diff -- For exclude filter case no need to read blocklet column data as every time we are returning true --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226344587 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -179,6 +167,75 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return null; } + private boolean isScanRequired(DimensionRawColumnChunk dimensionRawColumnChunk, int i) { --- End diff -- please change i to columnIndex --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226390252 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RangeValueFilterExecuterImpl.java --- @@ -146,6 +146,44 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return applyNoAndDirectFilter(rawBlockletColumnChunks, useBitsetPipeLine); } + @Override + public BitSet prunePages(RawBlockletColumnChunks blockChunkHolder) + throws FilterUnsupportedException, IOException { + // In case of Alter Table Add and Delete Columns the isDimensionPresentInCurrentBlock can be + // false, in that scenario the default values of the column should be shown. + // select all rows if dimension does not exists in the current block + if (!isDimensionPresentInCurrentBlock) { + int i = blockChunkHolder.getDataBlock().numberOfPages(); --- End diff -- change i to numberOfPages --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226392086 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -179,6 +167,75 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return null; } + private boolean isScanRequired(DimensionRawColumnChunk dimensionRawColumnChunk, int i) { + boolean scanRequired; + // for no dictionary measure column comparison can be done + // on the original data as like measure column + if (DataTypeUtil.isPrimitiveColumn(dimColumnEvaluatorInfo.getDimension().getDataType()) + && !dimColumnEvaluatorInfo.getDimension().hasEncoding(Encoding.DICTIONARY)) { + scanRequired = isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], + dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys(), + dimColumnEvaluatorInfo.getDimension().getDataType()); + } else { + scanRequired = isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], + dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys(), + dimensionRawColumnChunk.getMinMaxFlagArray()[i]); + } + return scanRequired; + } + + @Override + public BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks) + throws FilterUnsupportedException, IOException { + if (isDimensionPresentInCurrentBlock) { + int chunkIndex = segmentProperties.getDimensionOrdinalToChunkMapping() + .get(dimColumnEvaluatorInfo.getColumnIndex()); + if (null == rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex]) { + rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex] = + rawBlockletColumnChunks.getDataBlock() + .readDimensionChunk(rawBlockletColumnChunks.getFileReader(), chunkIndex); + } + DimensionRawColumnChunk dimensionRawColumnChunk = + rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex]; + filterValues = dimColumnExecuterInfo.getFilterKeys(); + BitSet bitSet = new BitSet(dimensionRawColumnChunk.getPagesCount()); + for (int i = 0; i < dimensionRawColumnChunk.getPagesCount(); i++) { + if (dimensionRawColumnChunk.getMaxValues() != null) { + if (isScanRequired(dimensionRawColumnChunk, i)) { + bitSet.set(i); + } + } else { + bitSet.set(i); + } + } + return bitSet; + } else if (isMeasurePresentInCurrentBlock) { + int chunkIndex = segmentProperties.getMeasuresOrdinalToChunkMapping() + .get(msrColumnEvaluatorInfo.getColumnIndex()); + if (null == rawBlockletColumnChunks.getMeasureRawColumnChunks()[chunkIndex]) { + rawBlockletColumnChunks.getMeasureRawColumnChunks()[chunkIndex] = + rawBlockletColumnChunks.getDataBlock() + .readMeasureChunk(rawBlockletColumnChunks.getFileReader(), chunkIndex); + } + MeasureRawColumnChunk measureRawColumnChunk = + rawBlockletColumnChunks.getMeasureRawColumnChunks()[chunkIndex]; + BitSet bitSet = new BitSet(measureRawColumnChunk.getPagesCount()); + for (int i = 0; i < measureRawColumnChunk.getPagesCount(); i++) { + if (measureRawColumnChunk.getMaxValues() != null) { + if (isScanRequired(measureRawColumnChunk.getMaxValues()[i], + measureRawColumnChunk.getMinValues()[i], msrColumnExecutorInfo.getFilterKeys(), + msrColumnEvaluatorInfo.getType())) { + bitSet.set(i); + } + } else { + bitSet.set(i); + } + } + return bitSet; + } + return null; --- End diff -- for dimension/measure column which is not present in current block returning null is ok?? --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226394853 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/scanner/impl/BlockletFilterScanner.java --- @@ -316,4 +320,167 @@ private BlockletScannedResult executeFilter(RawBlockletColumnChunks rawBlockletC readTime.getCount() + dimensionReadTime); return scannedResult; } + + /** + * This method will process the data in below order + * 1. first apply min max on the filter tree and check whether any of the filter + * is fall on the range of min max, if not then return empty result + * 2. If filter falls on min max range then apply filter on actual + * data and get the pruned pages. + * 3. if pruned pages are not empty then read only those blocks(measure or dimension) + * which was present in the query but not present in the filter, as while applying filter + * some of the blocks where already read and present in chunk holder so not need to + * read those blocks again, this is to avoid reading of same blocks which was already read + * 4. Set the blocks and filter pages to scanned result + * + * @param rawBlockletColumnChunks blocklet raw chunk of all columns + * @throws FilterUnsupportedException + */ + private BlockletScannedResult executeFilterForPages( + RawBlockletColumnChunks rawBlockletColumnChunks) + throws FilterUnsupportedException, IOException { + long startTime = System.currentTimeMillis(); + QueryStatistic totalBlockletStatistic = queryStatisticsModel.getStatisticsTypeAndObjMap() + .get(QueryStatisticsConstants.TOTAL_BLOCKLET_NUM); + totalBlockletStatistic.addCountStatistic(QueryStatisticsConstants.TOTAL_BLOCKLET_NUM, + totalBlockletStatistic.getCount() + 1); + // apply filter on actual data, for each page + BitSet pages = this.filterExecuter.prunePages(rawBlockletColumnChunks); + // if filter result is empty then return with empty result + if (pages.isEmpty()) { + CarbonUtil.freeMemory(rawBlockletColumnChunks.getDimensionRawColumnChunks(), + rawBlockletColumnChunks.getMeasureRawColumnChunks()); + + QueryStatistic scanTime = queryStatisticsModel.getStatisticsTypeAndObjMap() + .get(QueryStatisticsConstants.SCAN_BLOCKlET_TIME); + scanTime.addCountStatistic(QueryStatisticsConstants.SCAN_BLOCKlET_TIME, + scanTime.getCount() + (System.currentTimeMillis() - startTime)); + + QueryStatistic scannedPages = queryStatisticsModel.getStatisticsTypeAndObjMap() + .get(QueryStatisticsConstants.PAGE_SCANNED); + scannedPages.addCountStatistic(QueryStatisticsConstants.PAGE_SCANNED, + scannedPages.getCount()); + return createEmptyResult(); + } + + BlockletScannedResult scannedResult = + new FilterQueryScannedResult(blockExecutionInfo, queryStatisticsModel); + + // valid scanned blocklet + QueryStatistic validScannedBlockletStatistic = queryStatisticsModel.getStatisticsTypeAndObjMap() + .get(QueryStatisticsConstants.VALID_SCAN_BLOCKLET_NUM); + validScannedBlockletStatistic + .addCountStatistic(QueryStatisticsConstants.VALID_SCAN_BLOCKLET_NUM, + validScannedBlockletStatistic.getCount() + 1); + // adding statistics for valid number of pages + QueryStatistic validPages = queryStatisticsModel.getStatisticsTypeAndObjMap() + .get(QueryStatisticsConstants.VALID_PAGE_SCANNED); + validPages.addCountStatistic(QueryStatisticsConstants.VALID_PAGE_SCANNED, + validPages.getCount() + pages.cardinality()); + QueryStatistic scannedPages = queryStatisticsModel.getStatisticsTypeAndObjMap() + .get(QueryStatisticsConstants.PAGE_SCANNED); + scannedPages.addCountStatistic(QueryStatisticsConstants.PAGE_SCANNED, + scannedPages.getCount() + pages.cardinality()); + // get the row indexes from bit set for each page + int[] pageFilteredPages = new int[pages.cardinality()]; + int index = 0; + for (int i = pages.nextSetBit(0); i >= 0; i = pages.nextSetBit(i + 1)) { + pageFilteredPages[index++] = i; + } + // count(*) case there would not be any dimensions are measures selected. + int[] numberOfRows = new int[pages.cardinality()]; + for (int i = 0; i < numberOfRows.length; i++) { + numberOfRows[i] = rawBlockletColumnChunks.getDataBlock().getPageRowCount(i); + } + long dimensionReadTime = System.currentTimeMillis(); + dimensionReadTime = System.currentTimeMillis() - dimensionReadTime; + --- End diff -- Please remove empty lines --- |
In reply to this post by qiuchenjian-2
Github user kumarvishal09 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226395410 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeGrtThanFiterExecuterImpl.java --- @@ -148,6 +148,61 @@ private void ifDefaultValueMatchesFilter() { return bitSet; } + @Override + public BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks) + throws FilterUnsupportedException, IOException { --- End diff -- For all the RowLevelRangeFilters can we move some part of code to it's super class to remove code duplication?? --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226863702 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/chunk/impl/MeasureRawColumnChunk.java --- @@ -94,7 +95,7 @@ public ColumnPage decodeColumnPage(int pageNumber) { public ColumnPage convertToColumnPageWithOutCache(int index) { assert index < pagesCount; // in case of filter query filter columns blocklet pages will uncompressed - // so no need to decode again + // so no need to decodeAndFillVector again --- End diff -- ok --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226863773 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/ExcludeFilterExecuterImpl.java --- @@ -143,6 +144,40 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return null; } + @Override + public BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks) + throws FilterUnsupportedException, IOException { + if (isDimensionPresentInCurrentBlock) { + int chunkIndex = segmentProperties.getDimensionOrdinalToChunkMapping() + .get(dimColEvaluatorInfo.getColumnIndex()); + if (null == rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex]) { --- End diff -- it is read to get page count --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226863869 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -179,6 +167,75 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return null; } + private boolean isScanRequired(DimensionRawColumnChunk dimensionRawColumnChunk, int i) { + boolean scanRequired; + // for no dictionary measure column comparison can be done + // on the original data as like measure column + if (DataTypeUtil.isPrimitiveColumn(dimColumnEvaluatorInfo.getDimension().getDataType()) + && !dimColumnEvaluatorInfo.getDimension().hasEncoding(Encoding.DICTIONARY)) { + scanRequired = isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], + dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys(), + dimColumnEvaluatorInfo.getDimension().getDataType()); + } else { + scanRequired = isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], + dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys(), + dimensionRawColumnChunk.getMinMaxFlagArray()[i]); + } + return scanRequired; + } + + @Override + public BitSet prunePages(RawBlockletColumnChunks rawBlockletColumnChunks) + throws FilterUnsupportedException, IOException { + if (isDimensionPresentInCurrentBlock) { + int chunkIndex = segmentProperties.getDimensionOrdinalToChunkMapping() + .get(dimColumnEvaluatorInfo.getColumnIndex()); + if (null == rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex]) { + rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex] = + rawBlockletColumnChunks.getDataBlock() + .readDimensionChunk(rawBlockletColumnChunks.getFileReader(), chunkIndex); + } + DimensionRawColumnChunk dimensionRawColumnChunk = + rawBlockletColumnChunks.getDimensionRawColumnChunks()[chunkIndex]; + filterValues = dimColumnExecuterInfo.getFilterKeys(); + BitSet bitSet = new BitSet(dimensionRawColumnChunk.getPagesCount()); + for (int i = 0; i < dimensionRawColumnChunk.getPagesCount(); i++) { + if (dimensionRawColumnChunk.getMaxValues() != null) { + if (isScanRequired(dimensionRawColumnChunk, i)) { + bitSet.set(i); + } + } else { + bitSet.set(i); + } + } + return bitSet; + } else if (isMeasurePresentInCurrentBlock) { + int chunkIndex = segmentProperties.getMeasuresOrdinalToChunkMapping() + .get(msrColumnEvaluatorInfo.getColumnIndex()); + if (null == rawBlockletColumnChunks.getMeasureRawColumnChunks()[chunkIndex]) { + rawBlockletColumnChunks.getMeasureRawColumnChunks()[chunkIndex] = + rawBlockletColumnChunks.getDataBlock() + .readMeasureChunk(rawBlockletColumnChunks.getFileReader(), chunkIndex); + } + MeasureRawColumnChunk measureRawColumnChunk = + rawBlockletColumnChunks.getMeasureRawColumnChunks()[chunkIndex]; + BitSet bitSet = new BitSet(measureRawColumnChunk.getPagesCount()); + for (int i = 0; i < measureRawColumnChunk.getPagesCount(); i++) { + if (measureRawColumnChunk.getMaxValues() != null) { + if (isScanRequired(measureRawColumnChunk.getMaxValues()[i], + measureRawColumnChunk.getMinValues()[i], msrColumnExecutorInfo.getFilterKeys(), + msrColumnEvaluatorInfo.getType())) { + bitSet.set(i); + } + } else { + bitSet.set(i); + } + } + return bitSet; + } + return null; --- End diff -- this case not supposed to happen, even in applyFilter also return null. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226863873 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RangeValueFilterExecuterImpl.java --- @@ -146,6 +146,44 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return applyNoAndDirectFilter(rawBlockletColumnChunks, useBitsetPipeLine); } + @Override + public BitSet prunePages(RawBlockletColumnChunks blockChunkHolder) + throws FilterUnsupportedException, IOException { + // In case of Alter Table Add and Delete Columns the isDimensionPresentInCurrentBlock can be + // false, in that scenario the default values of the column should be shown. + // select all rows if dimension does not exists in the current block + if (!isDimensionPresentInCurrentBlock) { + int i = blockChunkHolder.getDataBlock().numberOfPages(); --- End diff -- ok --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2820#discussion_r226863877 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -179,6 +167,75 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, return null; } + private boolean isScanRequired(DimensionRawColumnChunk dimensionRawColumnChunk, int i) { --- End diff -- ok --- |
Free forum by Nabble | Edit this page |