GitHub user manishgupta88 opened a pull request:
https://github.com/apache/carbondata/pull/2725 [WIP] Added code to support storing min max for string columns based on number of characacters - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishgupta88/carbondata string_min_max_decision Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2725 ---- commit 6ef6d880e68aa3bfc1c704659c916b384e5d545a Author: manishgupta88 <tomanishgupta18@...> Date: 2018-09-14T05:13:20Z Modified code to support writing the min max flag for all the columns in the metadata. This will help in deciding whether min max for a column is written or not commit 4ed24f3972cb4c03133912ba8a1c9d35b3122e7a Author: manishgupta88 <tomanishgupta18@...> Date: 2018-09-15T06:20:03Z Modified code to support filter query using min max flag for a column ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/296/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8541/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/471/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/298/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8543/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/473/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/302/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8547/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/477/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/311/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218048262 --- Diff: core/src/main/java/org/apache/carbondata/core/datastore/DataRefNode.java --- @@ -125,4 +125,12 @@ DimensionRawColumnChunk readDimensionChunk(FileReader fileReader, int columnInde * @return */ BitSetGroup getIndexedData(); + + /** + * Return the array which contains the flag for each column whether the min max for that column + * is written in metadata or not + * + * @return + */ + boolean[] isMinMaxSet(); --- End diff -- Better change name to minMaxFlagArray` --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218059894 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataRefNode.java --- @@ -133,6 +134,21 @@ return null; } + @Override + public boolean[] isMinMaxSet() { + BlockletIndex blockletIndex = + blockInfos.get(index).getDetailInfo().getBlockletInfo().getBlockletIndex(); --- End diff -- No need to pass this information from driver to executor. --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218060271 --- Diff: core/src/main/java/org/apache/carbondata/core/metadata/blocklet/index/BlockletIndex.java --- @@ -75,4 +80,12 @@ public void setMinMaxIndex(BlockletMinMaxIndex minMaxIndex) { this.minMaxIndex = minMaxIndex; } + @Override public void write(DataOutput out) throws IOException { --- End diff -- Please don't serialize as it is not required to pass to executor --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218061484 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/IncludeFilterExecuterImpl.java --- @@ -108,10 +108,13 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, BitSetGroup bitSetGroup = new BitSetGroup(dimensionRawColumnChunk.getPagesCount()); filterValues = dimColumnExecuterInfo.getFilterKeys(); boolean isDecoded = false; + boolean isMinMaxSetForFilterDimension = + rawBlockletColumnChunks.getDataBlock().isMinMaxSet()[chunkIndex]; for (int i = 0; i < dimensionRawColumnChunk.getPagesCount(); i++) { if (dimensionRawColumnChunk.getMaxValues() != null) { - if (isScanRequired(dimensionRawColumnChunk.getMaxValues()[i], - dimensionRawColumnChunk.getMinValues()[i], dimColumnExecuterInfo.getFilterKeys())) { + if (!isMinMaxSetForFilterDimension || isScanRequired( --- End diff -- Please use the minaxflags from the current page and do the scan require check --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218062043 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/filter/executer/RowLevelRangeGrtThanFiterExecuterImpl.java --- @@ -194,9 +198,12 @@ public BitSetGroup applyFilter(RawBlockletColumnChunks rawBlockletColumnChunks, BitSetGroup bitSetGroup = new BitSetGroup(rawColumnChunk.getPagesCount()); FilterExecuter filterExecuter = null; boolean isExclude = false; + boolean isMinMaxSetForFilterDimension = + rawBlockletColumnChunks.getDataBlock().isMinMaxSet()[chunkIndex]; for (int i = 0; i < rawColumnChunk.getPagesCount(); i++) { if (rawColumnChunk.getMaxValues() != null) { - if (isScanRequired(rawColumnChunk.getMaxValues()[i], this.filterRangeValues)) { + if (!isMinMaxSetForFilterDimension || isScanRequired(rawColumnChunk.getMaxValues()[i], --- End diff -- Please use the minaxflags from the current page and do the scan require check --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218062180 --- Diff: core/src/main/java/org/apache/carbondata/core/scan/scanner/impl/BlockletFilterScanner.java --- @@ -122,11 +122,11 @@ public boolean isScanRequired(DataRefNode dataBlock) { bitSet = ((ImplicitColumnFilterExecutor) filterExecuter) .isFilterValuesPresentInBlockOrBlocklet( dataBlock.getColumnsMaxValue(), - dataBlock.getColumnsMinValue(), blockletId); + dataBlock.getColumnsMinValue(), blockletId, dataBlock.isMinMaxSet()); } else { bitSet = this.filterExecuter .isScanRequired(dataBlock.getColumnsMaxValue(), - dataBlock.getColumnsMinValue()); + dataBlock.getColumnsMinValue(), dataBlock.isMinMaxSet()); --- End diff -- Please use the minaxflags from the current footer and do the scan require check --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/487/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2725#discussion_r218066737 --- Diff: core/src/main/java/org/apache/carbondata/core/util/CarbonMetadataUtil.java --- @@ -250,6 +275,42 @@ public static BlockletIndex getBlockletIndex(EncodedBlocklet encodedBlocklet, } /** + * This method will combine the writeMinMax flag from all the pages. If any page for a given + * dimension has writeMinMax flag set to false then min max for that dimension will nto be + * written in any of the page and metadata + * + * @param blockletMinMaxIndex + * @param encodedBlocklet + */ + private static List<Boolean> mergeWriteMinMaxFlagForAllPages( + BlockletMinMaxIndex blockletMinMaxIndex, EncodedBlocklet encodedBlocklet) { + Boolean[] mergedWriteMinMaxFlag = + new Boolean[encodedBlocklet.getNumberOfDimension() + encodedBlocklet.getNumberOfMeasure()]; + // set writeMinMax flag to true for all the columns by default and then update if stats object + // has the this flag set to false + Arrays.fill(mergedWriteMinMaxFlag, true); + for (int i = 0; i < encodedBlocklet.getNumberOfDimension(); i++) { + for (int pageIndex = 0; pageIndex < encodedBlocklet.getNumberOfPages(); pageIndex++) { + EncodedColumnPage encodedColumnPage = + encodedBlocklet.getEncodedDimensionColumnPages().get(i).getEncodedColumnPageList() + .get(pageIndex); + SimpleStatsResult stats = encodedColumnPage.getStats(); + if (!stats.writeMinMax()) { + mergedWriteMinMaxFlag[i] = stats.writeMinMax(); + String columnName = encodedColumnPage.getActualPage().getColumnSpec().getFieldName(); + LOGGER.info("Min Max writing ignored for column " + columnName + " from page 0 to " --- End diff -- Add generic message saying that `"Min Max writing of blocklet ignored for column with name " + columnName + " ` --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2725 Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/8557/ --- |
Free forum by Nabble | Edit this page |