GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/2574 [HotFix][CARBONDATA-2788][BloomDataMap] Fix bugs in incorrect query result with bloom datamap This PR solve three problems which will affect the correctness of the query on bloom. 1. optimize blockletId in rebuilding datamap After review the code, we found that modification in PR2539 is not needed, so we revert that PR. 2. bugs in overflow for blocklet count Carbondata stores blocklet count for each block in byte data type, when a block contains more than 128 blocklets, it will overflow the byte limits. Here we change the data type to short. 3. Fix bug in querying with bloom datamap with block cache level enabled In block cache level scenario, previously the main BlockDataMap return block as pruned blocklet with its blockletId=-1; However, other index datamap such as BloomDataMap return actual blocklet with correct blockletId. Due to the behaviour of Blocklet's hashcode, some blocklets will be uncorrectly marked as duplicated and dropped. Thus cause incorrect query result. To fix this problem, we will return all blocklets with correct blockletId for the block instead of returning a fake blocklet with blockletId=-1. This will not affect the following procedure. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata 0728_fix_bug_query_bloom_opt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2574.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2574 ---- commit cba17975affad343d555f49cb255043b556310f7 Author: xuchuanyin <xuchuanyin@...> Date: 2018-07-27T13:13:51Z Fix bugs in overflow for blocklet count Carbondata stores blocklet count for each block in byte data type, when a block contains more than 128 blocklets, it will overflow the byte limits. Here we change the data type to short. commit 7e41d9533dae86ab26e102a0342c128cbab03d1c Author: xuchuanyin <xuchuanyin@...> Date: 2018-07-28T07:37:38Z Revert optimize blockletId in rebuilding datamap We found querying huge data with rebuilding bloom datamap will give incorrect result. The root cause is that the blockletId in ResultCollector is wrong. (This was introduced in PR2539) We will revert the previous modification for this. Now it is checked and works fine. commit e418018e05cb0a15c996c5bb58debb0486252f84 Author: xuchuanyin <xuchuanyin@...> Date: 2018-07-28T09:08:52Z Fix bug in querying with bloom datamap with block cache level enabled In block cache level scenario, previously the main BlockDataMap return block as pruned blocklet with its blockletId=-1; However, other index datamap such as BloomDataMap return actual blocklet with correct blockletId. Due to the behaviour of Blocklet's hashcode, some blocklets will be uncorrectly marked as duplicated and dropped. Thus cause incorrect query result. To fix this problem, we will return all blocklets with correct blockletId for the block instead of returning a fake blocklet with blockletId=-1. This will not affect the following procedure. ---- --- |
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2574 hi @manishgupta88, can you check this PR. It is intended to replace PR #2565 --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2574 hi @ravipesala can you check this PR also. It contains three problems, you can refer to the description of the PR for more information --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2574 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7582/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2574 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6337/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2574 LGTM , @manishgupta88 Please review once --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2574 refactored the pruning procedure for BlockDataMap --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2574 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6043/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2574 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7591/ --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2574#discussion_r205948964 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMap.java --- @@ -223,17 +222,15 @@ protected short getBlockletId(DataMapRow dataMapRow) { return dataMapRow.getShort(BLOCKLET_ID_INDEX); } - protected ExtendedBlocklet createBlocklet(DataMapRow row, String fileName, short blockletId, - boolean useMinMaxForPruning) { + protected ExtendedBlocklet createBlocklet(DataMapRow row, String fileName, short blockletId) { if (isLegacyStore) { - return super.createBlocklet(row, fileName, blockletId, useMinMaxForPruning); + return super.createBlocklet(row, fileName, blockletId); --- End diff -- useMinMaxForPruning flag is added to handle a different scenario where cache_level = blocklet and the filter column min max in not cached in driver. In that case the blocklet pruning for min/max needs to be done in each executor which is identified using this flag. --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2574#discussion_r205948912 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockDataMap.java --- @@ -542,48 +551,48 @@ protected void createSummaryDMStore(BlockletDataMapModel blockletDataMapModel) List<Blocklet> blocklets = new ArrayList<>(); CarbonRowSchema[] schema = getFileFooterEntrySchema(); String filePath = getFilePath(); - int numBlocklets = 0; - if (filterExp == null) { - numBlocklets = memoryDMStore.getRowCount(); - for (int i = 0; i < numBlocklets; i++) { - DataMapRow safeRow = memoryDMStore.getDataMapRow(schema, i).convertToSafeRow(); - blocklets.add(createBlocklet(safeRow, getFileNameWithFilePath(safeRow, filePath), - getBlockletId(safeRow), false)); - } - } else { - // Remove B-tree jump logic as start and end key prepared is not - // correct for old store scenarios - int startIndex = 0; - numBlocklets = memoryDMStore.getRowCount(); - FilterExecuter filterExecuter = FilterUtil - .getFilterExecuterTree(filterExp, getSegmentProperties(), null, getMinMaxCacheColumns()); - // flag to be used for deciding whether use min/max in executor pruning for BlockletDataMap - boolean useMinMaxForPruning = useMinMaxForExecutorPruning(filterExp); - // min and max for executor pruning - while (startIndex < numBlocklets) { - DataMapRow safeRow = memoryDMStore.getDataMapRow(schema, startIndex).convertToSafeRow(); - String fileName = getFileNameWithFilePath(safeRow, filePath); - short blockletId = getBlockletId(safeRow); - boolean isValid = - addBlockBasedOnMinMaxValue(filterExecuter, getMinMaxValue(safeRow, MAX_VALUES_INDEX), - getMinMaxValue(safeRow, MIN_VALUES_INDEX), fileName, blockletId); - if (isValid) { - blocklets.add(createBlocklet(safeRow, fileName, blockletId, useMinMaxForPruning)); + ByteBuffer byteBuffer = ByteBuffer.wrap(getBlockletRowCountForEachBlock()); --- End diff -- In case of legacy store (store in version < = 1.1), we are not storing the blocklet count as the index file footer does not contain the blocklet information...so this code will throw exception --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on the issue:
https://github.com/apache/carbondata/pull/2574 @xuchuanyin ...It is better that we should go with PR #2565 as it contains the proper code which handles the bugs. --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2574 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6346/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2574 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6053/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2574#discussion_r205995542 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockletDataMap.java --- @@ -223,17 +222,15 @@ protected short getBlockletId(DataMapRow dataMapRow) { return dataMapRow.getShort(BLOCKLET_ID_INDEX); } - protected ExtendedBlocklet createBlocklet(DataMapRow row, String fileName, short blockletId, - boolean useMinMaxForPruning) { + protected ExtendedBlocklet createBlocklet(DataMapRow row, String fileName, short blockletId) { if (isLegacyStore) { - return super.createBlocklet(row, fileName, blockletId, useMinMaxForPruning); + return super.createBlocklet(row, fileName, blockletId); --- End diff -- yes, there are test errors about this problem --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2574#discussion_r205996478 --- Diff: core/src/main/java/org/apache/carbondata/core/indexstore/blockletindex/BlockDataMap.java --- @@ -542,48 +551,48 @@ protected void createSummaryDMStore(BlockletDataMapModel blockletDataMapModel) List<Blocklet> blocklets = new ArrayList<>(); CarbonRowSchema[] schema = getFileFooterEntrySchema(); String filePath = getFilePath(); - int numBlocklets = 0; - if (filterExp == null) { - numBlocklets = memoryDMStore.getRowCount(); - for (int i = 0; i < numBlocklets; i++) { - DataMapRow safeRow = memoryDMStore.getDataMapRow(schema, i).convertToSafeRow(); - blocklets.add(createBlocklet(safeRow, getFileNameWithFilePath(safeRow, filePath), - getBlockletId(safeRow), false)); - } - } else { - // Remove B-tree jump logic as start and end key prepared is not - // correct for old store scenarios - int startIndex = 0; - numBlocklets = memoryDMStore.getRowCount(); - FilterExecuter filterExecuter = FilterUtil - .getFilterExecuterTree(filterExp, getSegmentProperties(), null, getMinMaxCacheColumns()); - // flag to be used for deciding whether use min/max in executor pruning for BlockletDataMap - boolean useMinMaxForPruning = useMinMaxForExecutorPruning(filterExp); - // min and max for executor pruning - while (startIndex < numBlocklets) { - DataMapRow safeRow = memoryDMStore.getDataMapRow(schema, startIndex).convertToSafeRow(); - String fileName = getFileNameWithFilePath(safeRow, filePath); - short blockletId = getBlockletId(safeRow); - boolean isValid = - addBlockBasedOnMinMaxValue(filterExecuter, getMinMaxValue(safeRow, MAX_VALUES_INDEX), - getMinMaxValue(safeRow, MIN_VALUES_INDEX), fileName, blockletId); - if (isValid) { - blocklets.add(createBlocklet(safeRow, fileName, blockletId, useMinMaxForPruning)); + ByteBuffer byteBuffer = ByteBuffer.wrap(getBlockletRowCountForEachBlock()); --- End diff -- for legacy store, the pruned result of other index datamap will be processed by âcreateBlockletFromRelativeBlockletIdâ which will also try to get blocklet count for each block --- will this also be a problem? --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2574 @ravipesala @manishgupta88 As we discussed online, we use PR #2565 and update it based on your advices. So I'll close this PR. --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin closed the pull request at:
https://github.com/apache/carbondata/pull/2574 --- |
Free forum by Nabble | Edit this page |