GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/2565 [HotFix][CARBONDATA-2788][BloomDataMap] Revert optimization for blockletId in rebuilding datamap We found querying huge data with rebuilding bloom datamap will give incorrect result. The root cause is that the blockletId in ResultCollector is wrong. (This was introduced in PR2539) We will revert the previous modification for this. Now it is checked and works fine. Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [ ] Any interfaces changed? - [ ] Any backward compatibility impacted? - [ ] Document update required? - [ ] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? - How it is tested? Please attach test report. - Is it a performance related change? Please attach the performance test report. - Any additional information to help reviewers in testing this change. - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata 0726_revert_rebuild_rdd_blockletno Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/2565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2565 ---- commit 8889078ea9d1328366dc27d633b3f5ebf1906322 Author: xuchuanyin <xuchuanyin@...> Date: 2018-07-26T15:22:58Z Revert optimize blockletId in rebuilding datamap We found querying huge data with rebuilding bloom datamap will give incorrect result. The root cause is that the blockletId in ResultCollector is wrong. (This was introduced in PR2539) We will revert the previous modification for this. Now it is checked and works fine. ---- --- |
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7538/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6292/ --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2565#discussion_r205655699 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -103,7 +106,19 @@ public void init(DataMapModel dataMapModel) throws IOException { /** * init field converters for index columns */ - public void initIndexColumnConverters(CarbonTable carbonTable, List<CarbonColumn> indexedColumn) { + public void initIndexColumnConverters(CarbonTable carbonTable, String dataMapName, + List<CarbonColumn> indexedColumn) { + String cacheLevel = MapUtils.getString( + carbonTable.getTableInfo().getFactTable().getTableProperties(), + CarbonCommonConstants.CACHE_LEVEL, CarbonCommonConstants.CACHE_LEVEL_DEFAULT_VALUE); + this.isBlockletCacheLevel = cacheLevel.equalsIgnoreCase("blocklet"); + if (!this.isBlockletCacheLevel) { + LOGGER.warn( + String.format("BloomFilter datamap %s runs with cache_level=block for table %s.%s," + + " which may decrease its pruning performance", --- End diff -- change to `which may decrease its pruning benefit, which lead to read more data` --- |
In reply to this post by qiuchenjian-2
Github user jackylk commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2565#discussion_r205656230 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/datamap/IndexDataMapRebuildRDD.scala --- @@ -357,13 +357,20 @@ class IndexDataMapRebuildRDD[K, V]( // skip clear datamap and we will do this adter rebuild reader.setSkipClearDataMapAtClose(true) + // currently blockletId in rowWithPosition is wrong, we cannot use it --- End diff -- This is a bit confusing, can you rephrase it --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2565#discussion_r205657316 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -103,7 +106,19 @@ public void init(DataMapModel dataMapModel) throws IOException { /** * init field converters for index columns */ - public void initIndexColumnConverters(CarbonTable carbonTable, List<CarbonColumn> indexedColumn) { + public void initIndexColumnConverters(CarbonTable carbonTable, String dataMapName, + List<CarbonColumn> indexedColumn) { + String cacheLevel = MapUtils.getString( + carbonTable.getTableInfo().getFactTable().getTableProperties(), + CarbonCommonConstants.CACHE_LEVEL, CarbonCommonConstants.CACHE_LEVEL_DEFAULT_VALUE); + this.isBlockletCacheLevel = cacheLevel.equalsIgnoreCase("blocklet"); + if (!this.isBlockletCacheLevel) { + LOGGER.warn( + String.format("BloomFilter datamap %s runs with cache_level=block for table %s.%s," + + " which may decrease its pruning performance", --- End diff -- OK --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2565#discussion_r205657632 --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/datamap/IndexDataMapRebuildRDD.scala --- @@ -357,13 +357,20 @@ class IndexDataMapRebuildRDD[K, V]( // skip clear datamap and we will do this adter rebuild reader.setSkipClearDataMapAtClose(true) + // currently blockletId in rowWithPosition is wrong, we cannot use it --- End diff -- OK --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7543/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6297/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2565 retest this please --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7545/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6299/ --- |
In reply to this post by qiuchenjian-2
Github user manishgupta88 commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2565#discussion_r205676555 --- Diff: datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMap.java --- @@ -165,9 +180,11 @@ public void initIndexColumnConverters(CarbonTable carbonTable, List<CarbonColumn for (CarbonBloomFilter bloomFilter : bloomIndexList) { boolean scanRequired = bloomFilter.membershipTest(new Key(bloomQueryModel.filterValue)); if (scanRequired) { + String blockletNo = + isBlockletCacheLevel ? String.valueOf(bloomFilter.getBlockletNo()) : "-1"; --- End diff -- I think Bloom dataMap should return the actual blocklet Id. The old behavior should not be modified based on cache level --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2565 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6020/ --- |
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:
https://github.com/apache/carbondata/pull/2565 SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6021/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7567/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6322/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7573/ --- |
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:
https://github.com/apache/carbondata/pull/2565 Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7575/ --- |
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata/pull/2565 Till now, there is still a problem: for the test case added in Bloom*FunctionSuite, the explain query output will give negative pruned result like: ``` |== CarbonData Profiler == Table Scan on test_rcd - total blocklets: 1 - filter: (city <> null and city = city40) - pruned by Main DataMap - skipped blocklets: 0 - pruned by CG DataMap - name: dm_rcd - provider: bloomfilter - skipped blocklets: -1 | |== Physical Plan == *FileScan carbondata default.test_rcd[id#172,country#173,city#174,population#175,random1#176,random2#177,random3#178,random4#179,random5#180,random6#181,random7#182,random8#183,random9#184,random10#185,random11#186,random12#187] PushedFilters: [IsNotNull(city), EqualTo(city,city40)]| ``` --- |
Free forum by Nabble | Edit this page |